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Preface 


This book originates from a collection of notes and solved problems which I created sometime ago for 
my personal use. However, I decided recently to improve and expand this collection and put it in the 
public domain for the common benefit. I also attached to the solved problems proposed exercises which 
mostly follow in their style and method of solution the corresponding problems or are based in their 
content and objective on these problems. I also added computer codes (in C++ language) to some of 
these problems for the purpose of calculation, test and simulation (as well as for the sake of demonstration 
and enhancement of the learning process). 


I used illustrations (such as figures and tables) when necessary or appropriate to enhance clarity and 
improve understanding. I also followed in many cases intuitive arguments and methods to make the notes 
and solutions look natural and instinctive (noting that the theory of probability at its basic level is largely 
based on common sense and intuition). Like my previous books, maximum clarity is one of the main 
objectives and criteria in determining the style of writing, presenting and structuring the book as well as 
selecting its items and contents. 


However, the reader should also notice that the book, in most parts, does not go beyond the basic 
probability and hence most subjects are presented and treated at their basic level. The reader is advised 
to consult the table of content and the index to have a feeling about the content and substance of this 
book. 


A rather modest mathematical background knowledge is required for digesting and understanding the 
book (or at least most of its contents). In fact, the book in most parts requires no more than a college 
or secondary school level of general mathematics. So, the intended readers of the book are primarily 
college (or A-level) students as well as junior undergraduate students (e.g. in mathematics or science or 
engineering). 

The book can be used as a text or as a reference for an introductory course on this subject and 
may also be used for general reading in mathematics. The book may also be adopted as a source of 
pedagogical materials which can supplement, for instance, tutorial sessions (e.g. in undergraduate courses 
on mathematics or science). 


An interesting feature of this book is that it is written and designed, in part, to address practical cal- 
culational issues (e.g. through sample codes and suggested methods of solution) and hence it is especially 
useful to those who are interested in the calculational applications of the probability theory (i.e. applied 
“mathematicians” working in the field of probability). Other practice-oriented features (such as the pro- 
posed exercises which we indicated above) also make the book more useful from the perspective of practice 
and application of the probability theory and related subjects. 


Taha Sochi 
London, January 2023 


(1 The number of computer codes that accompany this book is 31; 7 of which are for simulation and the rest are for other 
purposes (mainly calculation). These codes are available from my personal website: https://tahasochi.com/blog/. They 
are also available from ResearchGate https: //www.researchgate.net /publication/368242717_ProbabilityCodes. 
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Nomenclature 


In the following list, we define the common symbols, notations and abbreviations which are used in the 
book as a quick reference for the reader. 


v for all 

{.--} set 

@, {} empty (or null) set 

! factorial 

nN intersection of sets 

U union of sets 

€ in (or belong to) 

@ not in 

Cc subset 

¢ not subset 

=) super set 

es subset in general (if C is restricted to proper subset) 
» super set in general (if D is restricted to proper super set) 
A, B,C, sets 

A complement of set A 

AND logical AND operator 

Cc the set of complex numbers 

CE number of combinations of m in n with no repetition (binomial coefficient) 
Crn,m number of combinations of m in n with repetition 
OF sisi oa multinomial coefficient 

Cor(z, y) correlation of two random variables x and y 

Cov(z, y) covariance of two random variables x and y 

Eq., Eqs. Equation, Equations 

erf error function 

f function (density function) 

F cumulative distribution function 

H head (of coin) 

off if and only if 

nD n-dimensional (e.g. 1D) 

N the set of natural numbers (1, 2,3,...) 

Na number of elements of set A (a subset of the sample space) 
Ng number of elements of the sample space 

OR logical (non-exclusive) OR operator 

p,P probability 

P(A) probability of event A 


P(ANB) probability of “A AND B” 

P(AUB) probability of “A OR B” 

P(A|B) conditional probability (i.e. probability of A given B) 
number of permutations of m in n with no repetition 
number of permutations of m in n with repetition 
the set of rational numbers 

the set of real numbers 

sample space 

tail (of coin) 

universal set 

variance 

xX variable (random variable) 
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the set of integers 

parameter of the exponential distribution 
the (complete) gamma function 

the (incomplete) gamma function 
Poisson parameter 

arithmetic mean (or average or expectation value) 
geometric mean 

harmonic mean 

product symbol 

standard deviation 

summation symbol 


Chapter 1 
Preliminaries 


In this chapter we present and discuss a number of subjects and issues about probability and related 
matters which are generally needed for understanding and appreciating the materials presented in the 
subsequent chapters as well as avoiding potential sources of misunderstanding and confusion. 


1.1 General Remarks 


In this section we present a number of general remarks (related to the conventions, terminology and 

commonly occurring issues in this book) outlined in the following points: 

1. All numbers and variables in this book are real (i.e. not complex or imaginary). 

2. It is common in the theory of probability to use the notions and notations of sets, and in this book 
we follow this tradition. Accordingly, the familiarity of the readers with the basics of set theory 
is a prerequisite. Therefore, we provide in § 2.1 some preliminary materials about set theory for 
completeness and reference. So, those who are familiar with the basics of set theory can skip § 2.1 
(although it is useful to read it as revision and reminder). We also note that the required notation of 
sets is mostly given in the Nomenclature. 

3. In this book we deal with only univariate probability problems and issues although there are a few 
exceptions. 

4, We attach to the solved problems proposed exercises (which we label with PE). These exercises can 
(in most cases) be solved rather easily by following the method or content of solution in the related 
solved problem or consulting the given notes (i.e. in the main text). The purpose of these exercises is 
to reinforce understanding and provide more practicing opportunities to those who are keen to learn 
by practice and action (although they are used occasionally to draw the attention to related issues). 

5. The calculation and simulation codes that associate this book (as mentioned in the Preface) satisfy no 
more than the bare minimum of requirements for achieving the intended purposes and objectives, and 
hence common programming measures and standard practices in coding are largely and deliberately 
avoided and ignored. In addition to achieving their basic functionality, these codes are meant to 
give an idea about how to calculate or simulate the problems in question and to verify the stated 
results, and hence any extra elaboration or expansion has no benefit and could also be confusing and 
distracting. These codes are written in C++ programming language and compiled and run successfully 
on Microsoft Visual C++ 6.0 and Dev-C++ version 4.9.9.2 (as well as other versions) on Windows 
platforms (XP and 8.1). As indicated already, these codes are generally divided (with regard to their 
functionality and objectives) to two main categories: calculation and simulation. Also see the Preface 
and the sections and parts related to simulation and calculation (e.g. § 1.6 and § 2.3.3). 

6. “Random” (as well as similar words like “randomly” or “at random”) which is used frequently in this book 
to characterize probabilistic selection (or sampling) processes means there is no bias in the selection 
process (i.e. the selection is fair) and hence every potential candidate has the same chance of being 
selected in this process. 


1.2 Initial and Ultimate Probabilities 


It is important to note that in the theory of probability we have two types of probability: initial (or 
simple) and ultimate (or composite). In fact, initial probabilities are needed by the probability theory as 
given conditions or inputs for calculating and obtaining ultimate probabilities. For example, in the event 
of throwing a fair die consecutively we have an initial probability of 1/6 for getting each one of the six 
faces, and we have an ultimate probability of 1/36 for getting face “1” in two consecutive trials. In this 
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case, the ultimate probability of 1/36 is obtained from the initial probability of 1/6 in the two trials by 
multiplication, i.e. 1/6 x 1/6 = 1/36.?! 

Now, someone may ask: how do we get these initial probabilities? However, before we answer this 
question we would like to remark that the probability theory is not concerned with these initial probabilities 
because it takes them as given initial conditions (or assumptions or hypotheses or ... etc.). In other 
words, provided we are given these initial probabilities (and provided these initial probabilities meet the 
required conditions set by the postulates of the probability theory), the probability theory determines the 
ultimate probabilities. In fact, we can imagine the probability theory as a machine that takes these initial 
probabilities as an input (or raw material) and it processes them to produce the ultimate probabilities 
as an output (or final product) whose nature and characteristics depend on the demand and conditions 
presumed in the given problem. 

Returning to the question of how do we get the initial probabilities, we can say that there are different 
ways for getting them. Moreover, the method of getting them is case-dependent. For example, in some 
cases the initial probabilities are obtained experimentally (e.g. by tossing a coin many times to get the 
probability of getting head and tail for that coin; see for instance Problem 17 of § 3.2), while in other 
cases we may get them by reasoning (or guess or intuition or logic or ... etc.) where we may use rational 
thinking and theoretical argumentation. In fact, in some cases they may even be postulated and the 
results (i.e. the ultimate probabilities obtained from these postulated initial probabilities according to 
the probability theory) are then verified and validated experimentally or theoretically. In other words, 
the initial probabilities may be treated as modeling parameters and conditions that can be estimated, 
adjusted and tuned to get the best results. 

So in brief, in this book (as well as in similar texts in the literature of probability theory) we have no 
interest in how initial probabilities are obtained. Our concern is on how to obtain the required ultimate 
probabilities from the given initial probabilities with the help of the theory of probability in conjunction 
with the demand and conditions of the given problem. We should finally note that in many cases the 
initial probabilities are not given explicitly and directly but they are given implicitly and indirectly in the 
form of a count or a hint or something else and hence we need to infer or guess or calculate them from 
the given data and information. 


1.3. Subjective and Objective Probabilities 


“Probability” may be used in two main meanings: subjective probability and objective probability. Sub- 
jective probability is about the personal judgment and feeling of an observer about a certain event. 
For example, when I say “Tom is probably sick today” or “Mary is probably a good friend of mine” or 
“John is probably a bad guy” it is essentially about my personal judgment and feeling about the health 
of Tom, the friendship of Mary and the characters of John. Objective probability, on the other hand, 
is about objective reality when this reality can be in one form or another regardless of the observer (or 
presumably so). In fact, we can distinguish between these two meanings of probability more technically 
and specifically and in more details by the following points: 

e Subjective probability is about the personal judgment and feeling of an observer towards a definite 
reality, while objective probability is about a dubious reality in itself and regardless of any observer.!*! 

e Objective probability is based on the notion of outcome of a repetitive trial (where the trial or/and 
its repetition could be hypothetical and where the outcome of the individual trials is not unique), while 
subjective probability is not since it deals with a unique and determined reality (rather than repetitive 
trial and non-unique outcome). 

e Objective probability is strictly measurable and quantifiable according to well-known and well-defined 


[21In more technical terms (which will be discussed and explained later) the initial probabilities are the probabilities given 
to the points of the sample space. However, we should note that the given example is meant to demonstrate and clarify 
the idea of initial and ultimate probabilities rather than reflect the technicalities of this issue. In fact, whether or not 
1/36 is an ultimate probability depends on the modeling of the problem and the setting and selection of its sample space. 

[3] With regard to the conditional probability (and its consequences like the Bayes rule) the reality is dubious considering 
the hypothetical repetitive trials criterion. 
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rules! (i.e. the rules of probability theory which is the subject of this book),!5! while subjective probability 
is not. In fact, subjective probability is subject to psychological factors and personal considerations which 
are not measurable or quantifiable or predictable, and hence it varies from one person to another. For 
example, in my opinion “Mark is probably a good guy” but this may not be the opinion of my friend. 
Similarly, my (subjective) probability about the sickness of Sarah may be 50% but it could be 30% 
according to the (subjective) probability of my brother. On the other hand, objective probability is subject 
to realistic considerations and factual factors which are independent of any observer (or supposedly so). 
For example, the (objective) probability of getting head when we toss a fair coin is 1/2 with respect to any 
observer because this probability is based on realistic factors (e.g. determined by real life experiments) 
and hence it is independent of the observer and observation. 

e As indicated earlier, objective probability is the subject of the mathematical theory of probability, while 
subjective probability is not. So in brief, probability theory is about objective probability not subjective 
probability. 

In this regard it is useful to quote Feller about this issue (see the first volume of Feller book in the 
References): The success of the modern mathematical theory of probability is bought at a price: the 
theory is limited to one particular aspect of “chance.” The intuitive notion of probability is connected 
with inductive reasoning and with judgments such as “Paul is probably a happy man,” “Probably this 
book will be a failure,” “Fermat’s conjecture is probably false.” Judgments of this sort are of interest to 
the philosopher and the logician, and they are a legitimate object of a mathematical theory. It must be 
understood, however, that we are concerned not with modes of inductive reasoning but with something 
that might be called physical or statistical probability. In a rough way we may characterize this concept by 
saying that our probabilities do not refer to judgments but to possible outcomes of a conceptual experiment. 
(End of quote noting that italicization is from Feller) 

We should finally note that although we (for the sake of clarity) defined subjective probability in a way 
that makes it look different in subjects and instances from objective probability, subjective probability can 
also be recognized and attributed to subjects and instances that primarily belong to objective probability. 
For example, based on the factual factors the objective probability of getting a tail when tossing a fair coin 
is 1/2. However, someone (due to wrong conviction or poor arithmetic) may believe that this probability 
is 1/3 and hence his (subjective) probability in this case is 1/3. Anyway, it should be obvious that when 
we talk about probability in this book it should be understood to be about objective probability unless it 
is stated or indicated otherwise. 


1.4 Relevant and Irrelevant Factors 


In any particular probability problem there are certain considerations and factors that are relevant to our 
probability investigation and according to which this investigation should be decided and directed. Any 
other consideration or factor should be discarded and ignored in that investigation although it could be 
relevant or important to other investigations. For example, if we are interested in the gender of newborn 
babies then what is relevant in this investigation is being male or female regardless of being (for instance) 
white or black and healthy or unhealthy. Accordingly, the probabilities will be exclusively determined by 
the gender factor with no involvement of color or health or weight or anything else. So, if we study a 
sample of these babies (say the ones born in a given hospital in a given month) then we should classify 
them as males or females rather than (for instance) males or white females or black females. This is 
important especially when determining the sample space and assigning initial (or simple) probabilities to 
the sample points (as will be clarified more later on). So in brief, any factors and considerations that 
occur accidentally in the probabilistic situation under investigation should be excluded and ignored within 
that investigation although they could be relevant and important in another investigation. 


[4] Being subject to well-known and well-defined rules does not rule out the possibility of disagreements and disputes about 
certain aspects and at some occasions (some of which will be indicated or investigated later). 

[5] We may need to extend “the rules of probability theory” to include some rules. requirements and conditions related to 
initial probabilities and how they should be determined. 
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1.5 Factors Affecting Selection and Sampling 


In this section we investigate briefly factors and considerations that affect selection and sampling and 
hence they have an impact on the results of counting and probability. These factors and considerations 
occur frequently in the investigations of probability theory (including this book) and hence the reader 
must be aware of them and should understand and appreciate their significance and impact on counting 
and probability. 

We note that despite the fact that counting and probability at their basic level are generally based 
on (and compliant with) natural intuition and common sense, these factors and considerations (as well 
as other factors and considerations) can make the business of counting and probability very tricky and 
messy and can lead to situations and circumstances incompatible with common sense and may even be 
counter-intuitive. So, these factors and considerations should be investigated carefully and understood 
and appreciated very well to avoid confusion and mistakes. 

We should also note that these factors and considerations are generally dictated and determined by the 
nature of the problem in question and hence they are either stated explicitly or indicated implicitly but 
unambiguously. However, sometimes they are not stated or indicated so and hence it is up to us (i.e. 
whoever tackles and solves the problem) to decide about these factors; in which case considerations like 
common sense, convenience, context, circumstances and objectives should be taken into account to decide 
about these factors and make an appropriate choice about them. 

We finally note that although these factors and considerations are (rather) rarely stated explicitly (and 
as such) they are present in most counting and probability problems (usually in the background of these 
problems and within the intended context) and hence when we tackle these problems we should always 
take account of them in our analysis and solution. 


1.5.1 Distinguishability 


One of the main factors that affect counting (and hence probability) is the distinguishability and indis- 
tinguishability in the selection and sampling process.!®! For example, if we want to choose 3 cars out of 
5 cars then the result of this selection (as determined for instance by how the selected 3 cars will look) 
will be different if the 5 cars are distinguishable from each other (e.g. by having different colors) or not 
distinguishable (e.g. by having the same color). More specifically, if all the 5 cars are black then we have 
only one possibility for this selection with regard to the color (i.e. all the selected 3 cars are black), while 
if the 5 cars are different in color from each other then we have more than one possibility for this selection 
with regard to the color (e.g. black-red-blue, blue-red-green, yellow-green-red, etc.). 

In fact, we usually meet two main types of distinguishability /indistinguishability in the selection process 
that determine the results of counting and probability: 
e Distinguishability of objects, i.e. whether or not the objects in the selection process can be (or are 
wanted to be)! distinguished from each other (with regard to the aspect of concern and the property 
of interest in that particular situation and context). An instance of this type of distinguishability is the 
aforementioned cars example. 
e Distinguishability of arrangements, i.e. whether or not the arrangements and configurations of the 
selected objects in the selection process can be (or wanted to be) distinguished from each other (with 
regard to the aspect of concern and the property of interest in that particular situation and context). For 
example, if we have 5 cars of different colors and we want to select 3 of them (taking into account their 
distinguishability in color) then we either consider the yellow-green-red selection as a single arrangement 
(ignoring the order of the colored cars) or we consider the yellow-green-red selection as one arrangement 


[6] We note that “selection and sampling process” refers to a more general meaning than what it suggests initially and 
primarily. So, it includes for example (beside selection and sampling) things like “expectation from a random experiment” 
and “assignment of states to objects” (among many other things). 

[7] This is to indicate that in some cases and situations the objects can be distinguished but we deliberately ignore their dis- 
tinction and treat them as identical and indistinguishable because their distinction is irrelevant or not under consideration 
or not of importance to us. 
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among other similar arrangements involving these colors such as green-red-yellow (considering the order 
of the colored cars). 

So, in any selection process we should be aware of distinguishability /indistinguishability of the involved 
objects or/and the required arrangements, and accordingly we should consider and determine whether or 
not distinguishability is relevant (or possible or wanted or ... etc.) in that situation and problem before 
making any decision about how to manage the situation and tackle the problem. Otherwise, serious 
mistakes could be committed in solving counting and probability problems. 

It is worth noting that the distinguishability of arrangements generally depends on the distinguishability 
of objects (as well as on other factors and considerations which affect their distinguishability such as order 
in space or time or rank). We should also note that when objects and arrangements are classified as 
indistinguishable there is no difference between them being really indistinguishable (because we are not 
able to distinguish between them) and being distinguishable but we do not want to distinguish between 
them (and hence we treat them as indistinguishable). 


1.5.2 Replacement 


When we make a succession of selections (by taking an object or objects from a collection of objects in 
each selection) then we have two possibilities: either we return the selected object(s) to the collection 
before we make the next selection or not. The first case is commonly called selection with “replacement”. 
It is obvious that the result of the next selection depends on the type of the previous selection(s) and 
whether it is with or without replacement. This is because the collection in the case of replacement is not 
the same as the collection in the case of no replacement since the collection in the former case includes 
the object(s) selected in the previous selection(s). 

For example, let have a box containing a collection of 10 balls 7 of which are black and the remaining 
3 are red, and suppose that we made a first selection by taking the 3 red balls out of the box. Now, if we 
make next a second selection then if we replace the chosen 3 balls back into the box (i.e. before making 
the second selection) then our collection of choice in the second selection will be the original 10 balls (7 
black and 3 red) and hence we have an opportunity to have red balls in the second selection, but if we 
do not replace the chosen 3 balls back into the box (i.e. before making the second selection) then our 
collection of choice in the second selection will be the remaining 7 black balls and hence we do not have 
an opportunity to have red balls in the second selection. 

So, in the problems of counting and probability we should always be aware (and take account) of the type 
of selections that we make and if they are supposed to be with or without replacement so that the choices 
that we make and the probabilities that we estimate are correct and compliant with the conditions and 
requirements of the problem in hand. However, it should be noted that in many cases and situations the 
condition of “with or without replacement” may not be stated explicitly but it can be understood implicitly 
and inferred from the contexts and circumstances. We should also note that replacement usually apply 
to physical objects but not to non-physical (or abstract) objects, e.g. we can replace cards in a deck 
of cards or balls in a collection of balls but not letters in the alphabet or numbers in a set of numbers. 
However, in the case of dealing with non-physical objects repetition/no-repetition could take the role of 
replacement /no-replacement (see § 1.5.3). 


1.5.3 Repetition 


When we make a selection from a collection of objects we have two main cases with regard to the possibility 
of selecting identical objects, i.e. selection of identical objects is either allowed or not. The former case 
is commonly described as selection with repetition while the latter case is described as selection without 
repetition. For example, when we choose letters from the English alphabet to build words and sentences 
we can select any letter more than once, e.g. ‘s’ and ‘o’ are used 2 times each and ‘e’ is used 3 times for 
building the sentence “these are my books” and hence this is an instance of selection with repetition. On 
the other hand, if we have 26 stickers labeled uniquely with the 26 English alphabet letters and we want 
to build a sentence from these stickers then our choice of words and sentences is limited by the condition 
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that our selection of letters should be without repetition because no sticker can be used more than once, 
and accordingly we can build the sentence “he is not bad” but not the sentence “he is good”. 


1.5.4 Order 


Another factor that determines the type and nature of selection is the order of the selected objects. So, 
in one case the order of the selected objects is significant and in another case it is not. This factor affects 
the type of selection and the number of possibilities of the available choices. For example, let have a bag 
containing 9 balls numbered from 1 to 9 and we want to draw 3 balls out of this bag. In one case, we take 
3 balls at once and in one go and hence there is no order between the selected balls. So, if the 3 selected 
balls are the ones numbered with 2,5,9 then we have just one possibility for this selection since 2, 5,9 and 
5, 2,9 are the same because order is not a factor in this selection. However, in another case we take the 
3 balls one at a time considering their order in selection and hence the possibility 2,5,9 (which means the 
first ball is number 2, the second ball is number 5 and the third ball is number 9) is different from the 
possibility 5, 2,9 (which means the first ball is number 5, the second ball is number 2 and the third ball 
is number 9). So, it is important to be aware of this factor and its impact on the type of selection and 
the number of available possibilities and choices. 


Problems'®! 


1. Let have a bag containing 10 balls numbered from 1 to 10, and assume that we randomly selected a 
sample of 5 balls from this bag. Identify different types of sampling the 5 balls considering some of 
the factors investigated in this section (i.e. § 1.5). 

Answer: For example, we may consider the factors of replacement, repetition and order (noting that 
“10 balls numbered from 1 to 10” indicates that the balls are distinguishable). Accordingly, we can 
identify (at least) 4 types of sampling in this case, that is: 

(a) We may draw the 5 balls one after one without returning any one of the drawn balls to the bag 
before drawing the next ball. In this case we obviously have order. Moreover, by assumption there is 
no replacement and hence there is no possibility of repetition (e.g. if ball number 3 is drawn in the first 
draw then there is no possibility for this ball to appear again in the remaining 4 draws). Therefore, we 
can label this type of sampling as sampling with order and without replacement/repetition. 
(b) We may draw the 5 balls one after one but we return the drawn ball to the bag before drawing 
the next ball. In this case we obviously have order. Moreover, by assumption there is replacement and 
hence there is a possibility of repetition (e.g. if ball number 3 is drawn in the first draw then there is 
a possibility for this ball to appear again in the remaining 4 draws). Therefore, we can label this type 
of sampling as sampling with order and with replacement /repetition. 

(c) We may draw the 5 balls at once and in one go. In this case there is no order in this type of sampling 
(since the 5 balls are drawn simultaneously). Moreover, there is no possibility of replacement /repetition 
(for the same reason). Therefore, we can label this type of sampling as sampling without order 
and without replacement /repetition. 

(d) We may draw the 5 balls one after one and we return each drawn ball to the bag before drawing the 
next ball although we do not consider the order (e.g. the selection 2,5,3,9,7 is considered the same as 
7, 2,9, 3,5). In this case we obviously have no order (i.e. it is irrelevant) but we have replacement and 
hence there is a possibility of repetition. Therefore, we can label this type of sampling as sampling 
without order and with replacement /repetition. 

PE: Discuss the issue of distinguishability /indistinguishability of objects and arrangements in the 
context of this Problem. 

2. Repeat Problem 1 assuming this time that the 10 balls are indistinguishable (such as by numbers or 
colors or anything else). 

Answer: As the balls are indistinguishable, there is no meaning of order or repetition. Therefore, we 
can distinguish only between sampling with replacement and sampling without replacement. However, 
replacement can distinguish between the two sampling processes but not between the two samples 


[8] We note that these Problems belong to § 1.5 as a whole rather than to the present subsection. 
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obtained in these processes. So in brief, we cannot distinguish between the 5-ball samples obtained in 
any type of sampling. Also see Problem 39 of § 2.2. 
PE: What if 5 of the 10 balls in this Problem are colored black and the other 5 are colored white and 
hence indistinguishability is limited to each of these groups of 5? 

3. Explain how replacement affects the independence of successive trials (i.e. whether or not the results 
of the next trials depend on the results of the previous trials). 
Answer: In general and from this perspective, the results of successive trials are independent of each 
other in the case of replacement and dependent in the case of no replacement. For example, if we have 
a pack of 10 cards 5 of which are blue and the other 5 are red and we draw several cards in succession 
then it is obvious that if we replace the card after each draw before making the next draw then our 
chance of getting a red or blue card in each draw is the same and it is independent of the results of the 
previous draws (because the pack is the same in each draw). However, if we do not replace the drawn 
cards then our chance of getting a red or blue card in each draw depends on the results of the previous 
draws, ie. getting blue/red in the previous draws will increase the chance of getting red/blue in the 
next draws (and vice versa). 
PE: Why did we say “In general and from this perspective”? 

4. Give an example of a situation in which replacement has negligible effect, i.e. selection with and 
without replacement are virtually identical in effect. 
Answer: When selecting a small sample from a large population (of identical objects or large groups 
of identical objects) the effect of replacement becomes negligible because the nature of the population 
is not affected tangibly by taking the tiny sample. For example, if we draw two balls consecutively 
from a collection of 10000 balls half of which (i.e. 5000) are red and the other half are blue then our 
chance of getting a red (or blue) ball in the second draw is practically the same regardless of replacing 
or not replacing the first ball (and regardless of the color of the first ball). 
PE: Try to set some criteria to determine if replacement has negligible/non-negligible effect when 
taking a tiny sample from a large population. 


1.6 Simulation of Probability Problems 


Most of the probability problems can be easily simulated computationally by using random number 
generators to make random selections similar to the real life random selections and events. Although 
these generators are not really and exactly random, they are very close to be so. This is because the 
deterministic aspects and dependencies of these generators are entirely irrelevant to the probabilistic 
aspects of the virtual probabilistic experiment!®! and hence they provide a reliable way for imitating real 
life random occurrences and events. 

In this book, we demonstrated the use of computer simulation in probability (where simulation is 
relevant and useful) in a few solved problems using C++ programming language. Although the codes are 
deliberately written and structured in a simple way using basic techniques!!° (and hence they achieve no 
more than the minimum of the objectives of the simulated problems), they provide a useful way for the 
beginners to learn how to simulate probability problems (as well as providing a simple way for testing and 
checking the theoretical results obtained by analytical methods). As indicated in the Preface, the codes 
are freely available on the internet. 

We encourage those who have interest in using computational methods (whether in science or mathe- 
matics or engineering or something else, and whether in probability or something else) to inspect these 
codes. We also encourage them to solve the proposed exercises (i.e. the PE’s) that request writing or 
modifying or commenting or flowcharting computer codes that simulate probability problems. In fact, 
computer simulation is very effective (and rather easy and motivating) method in the investigations of 
probability problems and hence it should be considered as one of the main tools in these investigations. 


[9] We may also say: the effect of their deterministic aspect is negligible in the given context and presumed situation. 
[10] Ty fact, one reason for the intentional simplicity of these codes and the use of rather primitive methods of simulation is 
to avoid complexities that usually lead to confusion and unnecessary difficulties. 
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Therefore, those who have serious interest in probability should consider developing the skills of simulation 
(as well as calculation and computation in general) using programming languages and computer codes. 


1.7 Probability and Common Sense 


As indicated earlier, probability in its fundamental principles and at its basic level is generally intuitive and 
based on common sense. However, this is not always the case especially at the high levels of probability 
theory and its applications where probability problems and mathematics become complicated and involve 
many elements and considerations some of which at least are strange and far from daily life experiences, 
intuition and common sense. So, we advise the “mathematicians” of the probability theory to use their 
common sense and intuition in tackling and solving probability problems but with caution and to certain 
limits and up to certain levels. Accordingly, if the well established formulations of the probability theory 
and its verified mathematics lead to a result that is against their common sense and intuition then they 
should be prepared to digest it and accept it with no hesitation. In fact, there are many problems and 
results in probability theory (some of which will be met in this book) where the theory fails to comply 
with intuition and common sense (see for example Problem 3 of § 6.1). Anyway, computer simulation 
(see § 1.6) can provide a simple and effective method for verifying the suspected results and getting the 
required confidence or certainty (although simulation is not always applicable for this purpose noting for 
instance that some considerations and subtleties may not lend themselves to simulation). 


1.8 Controversies about Probability 


There are many controversies about probability and its mathematical theory. Most of these controversies 
are historical but others are still going on. So, probability theory and its applications are not as universal 
or agreed-upon as calculus or complex analysis for instance.!!!] Accordingly, we may have more than 
one opinion about the method of solving some probability problems, and the results that we obtain from 
these methods may not be identical. This is due for example to the presence of delicate considerations 
and subtleties in the given problem or the existence of conflicting opinions, definitions and conventions. 

In the following subsections we provide a brief glimpse into some of these controversies and disputes. 
However, before that we should note that although these controversies (or some of them at least) are 
not important to our investigation of the probability theory, awareness of their existence is important 
(at least to avoid potential confusion and misunderstanding in some occasions and circumstances). We 
should also note that the impact of these controversies is mostly minimal. Moreover, they can be settled 
if a systematic, tidy and transparent approach is followed. Therefore, the existence of these controversies 
should not affect the general confidence in the probability theory and its results. 


1.8.1 Controversies about the Definition of Probability 


The controversies about the definition of probability are mostly of historical nature and hence they are 
beyond the scope of this book (which is about the theory of probability not its history). However, the 
reader should be aware of the following: 

e Some of these controversies can have a real impact and tangible consequences on the mathematical 
formulation of the theory and hence their effect may not be restricted to the definition. In other words, 
they could have a practical impact on the results and applications of the theory and not just on its 
theoretical or axiomatic structure. 

e Some of these (primarily-historical) controversies can creep to modern day literature and could be 
traced in some writings and applications (of authors and researchers of modern or relatively-modern 
time). Anyway, we will generally follow the mainstream and commonly-accepted axiomatic approach of 
modern day probability theory (and hence the readers should not worry about this issue). 


[11] For example, getting the derivative of a function or its definite integral has a unique and undisputed answer in calculus. 
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1.8.2 Frequentism versus Bayesianism 


The difference between these views (or trends or schools) is essentially about the definition and interpre- 
tation of probability where the frequentist adopts the criterion (or concept) of relative frequency (in a 
large number of trials) as a basis for the definition and interpretation of probability while the Bayesian 
adopts the criterion of relative support of evidence to proposition. In this regard we note the following: 
e The probability of frequentism is essentially objective while the probability of Bayesianism is relatively 
subjective, i.e. it contains an element of subjectivity (see § 1.3). 

e The difference between frequentism and Bayesianism is not only of philosophical or contemplative nature 
but it usually leads to differences in the formulation of the probability theory and the results obtained 
from each school in its application. In fact, this is one cause for why solving a probability problem may 
not have a unique and agreed-upon solution (which we indicated previously). 

e Frequentism is generally the dominant trend in probability studies and applications although Bayesian- 
ism has also some staunch adherents. 

e There are advantages and disadvantages in both these schools as they both have merits and weaknesses 
although frequentism seems to be the strongest (which may explain why it is the dominant trend). 

e The more appropriate method to use (i.e. frequentism or Bayesianism) may depend on the problem 
in question and its nature. For example, frequentism could be the preferred method in certain branches 
of science or types of problems while Bayesianism could be the preferred method in other branches and 
types. 

e The probability theory that we investigate in this book essentially represents the frequentism view (and 
hence we insist on adopting the concept of objective probability which is based on the idea of “outcome of 
repetitive trial”). This is not only because of the dominance of the frequentism school in modern times, 
but also because of our belief that frequentism is more objective and hence it is more appropriate to 
use in sciences in general and in physical sciences in particular (noting that science is the main field of 
application for the probability theory). 


1.9 Modeling in Probability 


As we will see, the mathematics of probability is generally simple (at least for the level that we deal with 
in this book).!!?! In fact, the formal aspects of probability problems are generally trivial and are mainly 
achieved by elementary mathematical operations such as addition, subtraction, multiplication, summation 
and basic integration. What is difficult about probability and its theory, however, is the modeling of 
problems and casting them in a formal probabilistic style. In other words, what is difficult is to formulate 
the given probability problem in a formally-recognized way that makes it fit in a known and recognizable 
probability law or rule or pattern so that it can be tackled and solved by applying the formality of that 
law or rule or pattern. So, what the amateur mathematicians of probability theory should concentrate 
on (and be more keen to learn and acquire) is to develop the probability modeling skills, and hence they 
should look to the given solved problems and proposed exercises as being mainly practical instances for 
developing these modeling skills.!!3! In fact, this objective (i.e. developing probability modeling skills) 
was in our sight when we selected and created most of the solved problems and proposed exercises, and 
this is one reason why some of these problems may look trivial while others look more difficult than they 
should be for the level of this book. We also considered the diversity of these problems and exercises 
partly because of our consideration of the necessity of developing these skills. Accordingly, I advise the 
readers to be more keen about acquiring these modeling skills than about solving individual problems 
in their formal mathematical dimensions (i.e. as if they are calculus or complex analysis problems for 
instance). 


112] The discussion in this section should extend to subjects related to probability such as counting (which will be investigated 
in § 2.2). So, “probability” here is more general than its primary meaning. 

[13] Learning these modeling skills (i.e. how to pose the given problem in a manner that makes it solvable or easily solvable) 
is more of an art than a science. 


Chapter 2 
Mathematical Preliminaries 


In this chapter we present some mathematical preliminaries related mostly to the set theory and the 
methods and rules of counting. These preliminaries are required in the development of the theory of 
probability and its applications which will be investigated in the later chapters. 


2.1 The Basics of Set Theory 


In this section we investigate briefly the basics of set theory which is commonly used (as language, 
conventions, symbolism, etc.) in the presentation and formulation of probability theory. 

“Set” means a collection of objects with a common property!"4 irrespective of their order (and hence 
the set made of the elements a,b is the same as the set made of the elements b,a). These objects are 
described as members or elements of the set. These elements are not repetitive, i.e. each distinct 
element is represented once in the set (and hence the set made of the elements a, b is the same as the set 
made of the elements a, a,b or a,b,b). If an object a belongs to a set A (i.e. a is a member of A) we write 
a € A and if it does not belong to A we write a ¢ A. A set is specified mathematically either by listing 
its members (inside curly brackets {...} where the objects are separated from each other by commas) or by 
giving its description (inside curly brackets {...}) by identifying the common property(s) of its members. 
For example, the set A of the integer numbers from 1 to 5 may be stated mathematically as:!!5| 


A= {1,2,3,4,5} or A = {integers from 1 to 5} (1) 


Two sets are equal iff they have the same elements (and hence the equality sign = and the non-equality 
sign 4 should be interpreted accordingly). For example, if A is the set consisting of the numbers 1, 2,3, 4,5 
and B is the set of positive integers < 6 then we can write A = B. Sets are usually (but not necessarily) 
labeled with uppercase letters and their elements with lowercase letters. 

The empty set (symbolized by @ or {} and may also be called the null set) is a set that contains no 
element. For example, the set of “even prime numbers > 2” is empty because no even number > 2 can be 
prime since it is divisible by 2. A universal set is a set that includes all the elements of all the related 
sets of concern in the particular situation and context. For example, if we are interested in the vowels of 
the English alphabet (irrespective of being symbolized by lowercase or uppercase) then the universal set 
in this case and context is U = {a,e,7,0,u} since all vowels and groups of vowels belong to this set. As 
indicated, the universal set depends on the case and context (and hence if we shifted our attention to the 
letters of the English alphabet then the set {a,e,2,0,u} is not universal anymore). 

A set B is a subset ofa set A if all the members of B are members of A. For example, if A = {1, 2,3, 4,5} 
then B = {1,3,4} is a subset of A. By convention, the set itself and the empty set are subsets of any 
set (also see Problem 2). A proper subset of a set is a subset that is not the same as that set (i.e. the 


1144] The condition “with a common property” is to exclude collection of objects that have no common property and hence 


it is not sensible to treat them as a single set since set requires certain common features that characterize its elements 
and justify the application of its rules and associated concepts. For example, it is not sensible (in common situations) to 
have a set of apples and cars (i.e. as such). In fact, this condition should also be justified by the upcoming methods of 
specifying the set (i.e. by giving its description by identifying the properties of its members) because no such specification 
can be given unless the members of the set possess some common properties and features. 

115] In fact, each one of listing and description can be in several different forms. For example, we can write the above as: 


A = {3,2,4,5, 1} or A={n: nEN, 1<n<5} 


We also note that listing applies when the number of elements (of countable set) is small or can be written compactly 
(e.g. 1,2,...,k). 
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subset has fewer elements than the set itself). For example, if A = {1,2,3,4,5} then B = {1,3,4} isa 
proper subset but C = {2,4,3,1,5} is not (ie. it is a subset but not a proper one). The common symbol 
for subset is C (e.g. CC D means C is a subset of D). Other symbols are also used.!"® 

A set that contains exactly n elements is commonly described as a set of size n. Finite set is a set 
that has a finite number of elements, while infinite set is a set that has an infinite number of elements. 
Countable set is a set whose elements can be put in one-to-one correspondence with the set of natural 
numbers, while uncountable set cannot. Countable set can be finite or infinite (noting that some may 
limit the use of this term to infinite). In general, countable sets are described by discrete variables while 
uncountable sets are described by continuous variables. 

The intersection of two sets, A and B, is the set of elements that belong to both A and B. The symbol 
used for intersection is M. For example, if A = {a,b,c,d,e} and B = {d,e, f,g,h} then AN B = {d,e}. 
Intersection is easily generalized to more than two sets, e.g. A491 BNC means the intersection of the sets 
A, B,C which is the set of elements that belong to all these three sets. More generally, the intersection 
of n sets (Aj,...,An) is NL, A; which is the set of elements that belong to everyone of the n sets (noting 
that n could be infinite, ie. N92, Aj). 

The union of two sets, A and B, is the set of elements that belong to A or B or both. The symbol used 
for union is U. For example, if A = {a,b,c,d,e} and B = {d,e, f,g,h} then AU B = {a,b,c,d,e, f,g, h}. 
Union is easily generalized to more than two sets, e.g. AU BUC means the union of the sets A, B,C 
which is the set of elements that belong to any one of these three sets (individually or collectively). More 
generally, the union of n sets (Aj,..., An) is Ut_, A; which is the set of elements that belong to at least 
one of the n sets (noting that n could be infinite, ie. Uf, A;). 

The complement of a set A is the set that contains all the elements in the universal set that do 
not belong to A. The common symbol used for complement is a bar (or a line) over the symbol of 
the set (e.g. the complement of A is A), (17) For example, if we are interested in the vowels of the 
English alphabet then the complement of the set A = {a,i} is A = {e,o,u}. It is obvious that the 
universal set is made of the union of any one of its subsets and its complement, i.e. if A C U then 
U = {elements of A and elements of A} = AU A. 

The difference of a set A from a set B (which is symbolized as A— B or A\B) is the set of elements that 
belong to A but not to B.!"8! For example, if A = {a,b,c,d,e} and B = {d,e, f,g,h} then A—B = {a,b,c} 
while B— A = {f,g,h}. Two sets are mutually exclusive (or disjoint or incompatible) if their 
intersection is empty (i.e. they have no common element). In mathematical terms, A and B are mutually 
exclusive sets iff 

ANB=2 (2) 


A number of n sets (Aj,..., An) are pairwise exclusive (or disjoint) iff A; A; = © (i,j =1,...,n and 
i # j).29 

The operations on sets are characterized by certain properties and subject to certain rules and laws 
which are outlined in the following list:|?°! 
1. Rules of identity: 


GNA = @ (3) 


16] The symbol C may be used for subset in general (i.e. whether proper or not) and hence C is restricted to proper subset. 


The symbols > and D may also be used (corresponding to C and C respectively) to mean super set (i.e. proper and 
general respectively). However, in this book we generally use only C without distinction between proper and improper 
(or general). Yes, if distinction is required in a certain context then we will make the distinction clear. 

Other common symbols for complement include prime and superscript c (i.e. A’ and A°). 

The operation of taking the difference of sets is described as subtraction of sets. The similarity between difference 
and complement is obvious and hence the difference of A from B may be called the relative complement of B with 
respect to A (and thus complement may be called absolute complement). 


19] The reader is referred to Problem 5 of § 3.3 for further details about this issue. 
20 


17 
18 


We note that some of the following properties and rules that involve two sets can be extended and generalized to include 
more than two sets (and even to infinitely many sets). Some examples of these extensions and generalizations will be 
given later on. We should also note that although the above list is generally representative and inclusive to the main 
properties and rules, it is not entirely exhaustive (in fact we will meet examples of some other properties and rules in 
the solved Problems). 
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@UA = A (4) 
ANA = A (5) 
AUA = A (6) 
UNA = A (7) 
UUVA = U (8) 
2. Rules of complement: 
@ = U (9) 
A= A (10) 
ANA = @ (11) 
AUA = U (12) 
LE Sa. aes (13) 
3. Commutativity: 
ANB = BNA (14) 
AUB = BUA (15) 
A-B # B-A (16) 
4. Associativity: 
(ANB)NC = AN(BNC)=ANBNC (17) 
(AUB)UC = AU(BUC)=AUBUC (18) 
5. Distributivity (i.e. of intersection on union and union on intersection):!?4) 
AN(BUC) = (ANB)U(ANC) (19) 
AU(BNC) = (AUB)N(AUC) (20) 
6. De Morgan laws: 
ANB = AUB (21) 
AUB = ANB (22) 


A partition of a set A is a division (or grouping) of A into non-empty, disjoint and comprehensive 
subsets. For example, if A = {0,1, 2,3, 4,5,6, 7} then {{0, 1,3}, {2,5, 7}, {4,6}} is a partition of A because 
the subsets {0,1,3}, {2,5, 7} and {4,6} are non-empty, they have no shared elements (i.e. disjoint) and 
their union is A (i.e. comprehensive). Also, {{0, 3}, {1, 4}, {2, 5,6, 7}} and {{5, 6, 7}, {0, 1, 2,3, 4}} are two 
other partitions of A. The subsets in a partition of a set are commonly called cells (of that partition), 
e.g. {5,6,7} is a cell of the partition {{5,6, 7}, {0,1,2,3,4}}. It is obvious that the partition of a set is a 
set whose elements are sets and hence the partition is a set of sets. 

Problems 


1. State the following in the notation of sets (where A and B are sets representing outcomes of a trial 
while a, 6, are potential outcomes): 


[21] We note that there are other types of distributivity relations in the algebra of sets, e.g. distributivity relations involving 
difference of sets like AN (B—C) = (AN B)-—(ANC). We also note that distributivity can take place from left or from 
right (and hence we have left distributivity and right distributivity). However, most of these relations and properties are 
not needed in this book (noting that we will investigate some of them later on; see Problem 20). 
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(a) a is a member of A but not of B. (b) B is not a subset of A. 

(c) The elements of B which are not in A. (d) All possible outcomes are in A or in B. 
(e) 2 is neither in A nor in B. (f) y belongs to A and B. 

Answer: 

(a) a€ (A- B). (b) BEA. (c) B— A (or B\A). 
(d) AUB=U. (e) B¢ (AUB). (f) y € (ANB). 


PE: Repeat the Problem for the following (where A, B,C and D are sets): 
(a) The subset of a subset of A is a subset of A. (b) The complement of the complement of A is A. 


(c) No proper subset can be a universal set. (d) a belongs to A or B or C but not to D. 


(e, f) The complement of the intersection/union of A and B is the union/intersection of their comple- 
ments. 

2. Identify the status of the empty set as being proper or not proper subset. 
Answer: Following the literature, the empty set is a proper subset of every non-empty set but it is not 
a proper subset of itself (because it is itself). However, in our view this should be seen as a convention 
more than a proven fact or result. 
PE: Do you think the empty set is a real set (i.e. in the same sense as non-empty set) or it is a creation 
of mathematician to fill certain gaps and make some generalizations? 

3. The following statements are in the notation of sets (where A and B are sets). Express them in ordi- 
nary language. 


(a) {x: ce C,x¢€ R}. (b) {} CU. (c) SCA (AFD). 
(d) 0 =2. (e) ANB =AUuB. (£) 7 ¢ {N — {2,3,5,7,11,...}}. 
Answer: 


a) The set of strictly complex numbers (i.e. non-real complex numbers). 

b) The empty set is a subset of the universal set. 

c) The empty set is a subset of any other set. 

d) The complement of the universal set is the empty set. 

e) The complement of the intersection of A and B is the same as the union of their complements. 
f) The number 17 does not belong to the set of natural non-prime numbers. 

PE: Repeat the Problem for the following (where A, B and C are sets): 


( 
( 
( 
( 
( 
( 


(a) B={a: c«eU,x¢ A}. (b) ANBNCES. (c) ANBNC=U. 

(d) AUBUC ZU. (e) ANB=AUB. (f) A={a: « ¢U}. 
4. Give examples of finite and infinite sets. 

Answer: 


Examples of finite set: the set of students in a school, the set of human beings (in all places and times), 
the set of atoms in our galaxy, the set of even numbers between 0 and 100, the set of solutions of n*” 
order polynomial. 
Examples of infinite set: the set of natural numbers N, the set of real numbers between 0 and 1, the 
set of straight lines passing through a point (in 2D or 3D space), the set of points in a line segment, 
the set of solutions of consistent and dependent set of linear equations. 
PE: Give more examples of finite and infinite sets (six each). 

5. Give examples of countable and uncountable sets. 
Answer: 
Examples of (finite) countable set: the set {1,2,...,99}, the set of living beings on Earth, the set of 
stars in a galaxy, the set of solutions of consistent and independent set of linear equations. 
Examples of (infinite) countable set: the set of integers Z, the set of prime numbers, the set of rational 
numbers. 
Examples of uncountable set: the set of planes parallel to the zy plane, the set of complex numbers in 
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10. 


11. 


the origin-centered unit disk, the set of real numbers in the interval [1, 2]. 
PE: Give more examples of countable and uncountable sets (five each). 


. State some facts about the empty set @, the universal set U and their complements. 


Answer: 

e The complement of the empty set is the universal set, i.e. @ = U. 

e The complement of the universal set is the empty set, ie. U =. 

e For any set A we have: 2 CACU. 

e If A is a subset of the null set then A is null,ie. ACS >~> A=@. 

eU-A=A. 

PE: State three more facts about the empty set @, the universal set U and their complements. 


. State some facts about subsets. 


Answer: If A, B,C are any three sets then we have: 
eACA. 

eScACU. 

e(ANB)CAC (AUB). 

eA=Biff ACBANDBCA. 
eACBiff ANB=A. 

eACBif AUB=B. 

eACBif BCA. 
elfACBANDBCCthn ACC. 
elf ANB=2@ then AC B (and Bc A). 
elf AC Bthen AU(B—A)=B. 

PE: State four more facts about subsets. 


. Use the notation of sets to express the relation between integers Z, complex numbers C, natural 


numbers N, real numbers R, and rational numbers Q. 

Answer: NC ZCQCRCC. 

PE: Use the notation of sets to express the relation between the sets: primates (P), animals (A), homo 
sapiens (#7), living beings (LZ), and mammals (M). 


. Considering the set of Greek letters, let A = {a,6,6,p}, B = {a,7,u,v,w}, C = {{a, 7}, {p}} and 


D= {{3,{6,u}}. Which of the following is true/false: 


(a)ac B. (b) 6€ A. (c) @ED. (d) {a,y} CB. 
(e) {a,y} EC. (f) we A. (g) BNC = {a,y}. (h) TE AUB. 
Answer: 

(a) False. (b) True. (c) True. (d) True. 
(e) True. (f) True. (g) False. (h) False. 
PE: Repeat the Problem for the following: 

(a)aDd B. (b) CON D=0. (c) AUB = {a}. (d) A— B = {£,6, p}. 
(e) {} CD. (f) a € (ANB). (g) ANC = {a, p}. (h) w € (ANB). 


Give mathematical conditions equivalent to A C B (i.e. A is a subset of B). 
Answer: For example: 


BcA ANB=A AUB=B ANB=2 AUB=U 


PE: Give mathematical conditions equivalent to AN B = U. 

Identify the relationships between the following: {}, 2, {@}, {0}, 0. 

Answer: Let first determine the meaning of each one of these so that we can identify the relationship 
between them. {} and @ mean the null set, {@} means the set that has a single element which is 
the empty set (and hence it is a set of sets), {0} means the set that has a single element which is the 
number 0, and 0 is the number zero and hence it is not a set. Accordingly: 
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12. 


13. 


14. 


15. 


16. 


{f= {} € {9} {} c {9} {} c {0} BE {O} 


2c {oe} @ c {0} {a} # {0} 0 € {0} 

PE: Identify the relationships between the following: @,A,U where A is a non-empty proper subset 
of U. 

Associate intersection, union and complement with logical operators. 

Answer: In general, intersection is associated with AND, union is associated with (non-exclusive) 
OR, and complement is associated with NOT. 

PE: Suggest logical operators corresponding to the following set operations (where A and B are sets 
and U is the universal set): 


(a) (AU B)-— (ANB). (b) U—A. (c) AUB. (d) ANB. 
Give a simple justification for the commutativity of the operations of intersection and union (i.e. Eqs. 
14 and 15). 


Answer: Intersection is like logical AND (e.g. 4M B means something that belongs to A AND to B) 
and union is like logical OR (e.g. AU B means something that belongs to A OR to B). Now, since the 
logical operations of AND and OR are commutative, intersection and union must be commutative. |??! 
PE: Give a simple justification for the non-commutativity of the operation of difference of sets (i.e. 
Eq. 16). 

State some facts about complement and difference of sets. 

Answer: If A, B,C are any three sets then we have: 

eA—-B=ANB. 

e (A-B)N(B-A)=2. 

e (A— B)U(B-— A) = (AUB) — (ANB). 

eA-B=B-A. 


e (AN B)-(ANC)=AN(B-C). 

PE: Explain and justify in words the stated facts in the answer. Do you see any similarity between 
A-—B=B8B-— Aand the De Morgan laws (try to compare)? 

Generalize the distributivity property (i.e. of intersection on union and union on intersection) to any 
number of sets. 
Answer: Distributivity can be easily generalized to any number of sets by grouping (through employ- 
ing associativity) with repeated application of distributivity involving three sets (followed by employing 
associativity to get rid of grouping). For example, if A, B,C, D are four sets then: 


AN(BUCUD) 


I 


An [Bu CuD)| =(ANB)U [An(cuD)] 
= (ANB)U (anc) u(anD)| = (AN B)U(ANC)U(AND) 


and AU(BNCND) = Au|Bn(cnD)] =(4uB)n [Au(cnD)| 


= (AUB)N (Aucyn (AU D)| = (AUB)N(AUC)N (AUD) 
This pattern can be easily extended by induction to any number of sets. Accordingly, we can write: 
An (U;B;) = Ui(A M B;) 


PE: Justify in words each step of the above generalizations. 
Generalize the De Morgan laws to intersections and unions involving more than two sets. 
Answer: The De Morgan laws can be easily generalized to intersections and unions involving more than 


[22] Tn fact, this is ultimately based on the fact that the relationships represented by intersection and union (as well as AND 


and OR) are symmetric. 
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two sets by grouping (through employing associativity) with repeated application of the De Morgan 
laws involving two sets (followed by employing associativity to get rid of grouping). For example, if 
A, B,C are three sets then: 


ANBNC = AN(BNC)=AU(BNC)=AU(BUC)=AUBUC 
and AUBUC = AU(BUC)=AN(BUC)=AN(BNC)=ANBNC 


This pattern can be easily extended by induction to any number of sets. Accordingly, we can write: 


NA; = UA; 
and U;A; = Nn; A; 


PE: Justify in words each step of the above generalizations. 

17. Prove (or rather justify) the De Morgan law for union involving two sets (see Eq. 22). 
Answer: Referring to Eq. 22, AU B means the elements in the universal set that belong to A or B 
or both, and hence AU B means the elements in the universal set that belong neither to A nor to B. 
Similarly, A (and B) means the elements in the universal set that do not belong to A (and B), and 
hence AN B means the elements in the universal set that do not belong to 4 AND do not belong to 
B, i.e. the elements in the universal set that belong neither to A nor to B. Accordingly, the left and 
right hand sides of Eq. 22 have the same meaning (i.e. they have the same elements) and hence the 
equality is established. 
PE: Repeat the Problem for the De Morgan law for intersection involving two sets (see Eq. 21). 

18. In this Problem we have: YT = {a,b,...,2,A,B,...,Z}, A = {a,b,...,z}, UV = {A,B,...,Z}, and 
0 = {a,e,i,0,u, A, E,I,0,U}.!?3] Identify the following: 


(a) A. (b) ANW. (c) AUW. (d) A-W. (e) T-Q. 

(f) ANW. (g) AUW. (h) Y-2. (i) Q-T. (j) T-A. 

Answer: 

(a) w. (b) 2. (c) YT. (d) A. (e) {a: a is consonant}. 

(f) Y. (g) 2. (h) Q. (i) @. (j) A. 

PE: Repeat the Problem for the following: 

(a) Y. (b) V—A. (c) A-A. (d) TNQ. (e) V—-Q. 
19. Find all the partitions of the following sets: 

(a) {a, 8, 7}- (b) {a, 8,7, 5}. 

Answer: 

(a) We have 5 partitions: 

{{a}, {8}, tyty {{a, 8}, ttt {{a, 7, {8h} {{a}, {8, vt} {{a, 8, v}} 

(b) We have 15 partitions: 

{{a}, {8}, ty}, {ott {{a, 8}, 17}, {oth {{a, 7, {5}, {ott {{a, 0}, {8} {1th 

{{a}, {8,7}, {oh} {{a}, {8,5}, {1th {{a}, {8}, 17, Ott {{a, 8}, 17, OFF 

{{a, 7}, {8, o}} {{a, 9}, 18, 7th {{a}, {8,7, 5h} {{5}, {a, 7, 0h} 

{{7}; ta, 8, oF} {{5}, fa, 8, rh} {{a, 8,7, 5}} 


PE: What is the number of partitions of the set {a, 6, 7,6,e}? Also find 20 of these partitions. 
20. Prove (or justify) the following identities (where A, B and C are sets): 

(a) A-B=ANB. 

(b) (A- B)N(B- A)=@. 

(c) (A— B)U(B— A) = (AUB) - (ANB). 


[23] We note that in this Problem Y is the universal set and U € 2 is a letter (not the universal set). 
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l) (BNC)-A=(B-A)N(C— A). 


( 
( 
( 
( 
( 
(k) A- (BNC) =(A- B)U(A-C). 
( 
( 
( 
( 
( 
( 
( 


Answer: These identities (or at least some of them) are intuitive. In the following we use sometimes 
verbal arguments and sometimes formal arguments for the purpose of diversity and to show that many 
set relations and identities can be proved (or justified) by simple intuitive arguments rather than by 
formal arguments. We also note that some of these identities can be easily extended and generalized 
to include more sets. 

(a) A—B represents the elements of A which do not belong to B, while AN B represents the elements 
which belong to A and to B and hence they belong to A but not to B. So, A— B and AN B means 
the same thing and hence they are equal. 

(b) A—B represents the elements of A which do not belong to B, while B— A represents the elements 
of B which do not belong to A, and hence they cannot have any common element. 

(c) A—B represents the elements of A which do not belong to B, while B— A represents the elements 
of B which do not belong to A, and hence their union (A — B) U (B — A) represents the elements of 
the union of A and B (which is AU B) excluding the common elements of A and B (which is AN B). 
(d) The elements of A either belong to B (which are represented by AM B) or do not belong to B 
(which are represented by A — B), and hence their intersection must be empty. 

(e) The elements of A either belong to B (which are represented by AM B) or do not belong to B 
(which are represented by A — B), and hence their union must be the entire A. 
( ‘ 
( 


f) The elements of AU B must belong to A only (which are represented by A — B), or to B only 
which are represented by B— A), or to both (which are represented by AN B), and hence their union 
must be the entire AU B./41 

(g) The elements of B which are common to A but not to C [ie. AM (B-—C)] are the same as the 
elements of B which are common to A (i.e. AM B) excluding the elements of A which are common to 
C (ie. ANC). 

We may also show this formally as follows (starting from the right hand side): 


(An B) — (ANC) = (AN B)N(ANC) 


[24] We note that this argument also shows that A — B, B— A and AM B are disjoint. 
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21. 


= AN(BNC) (Eq. 17) 
=An(B-C) (see part a) 


(hh) We use the same (verbal) argument as the argument of (g). We may also obtain this formally 
from the relation of (g) by using the commutativity of intersection, i.e. by shifting the order of all 
intersections in the relation of (g) using Eq. 14. 

(i) A— (BUC) are the elements of A which belong neither to B nor to C. On the other hand, A— B 
are the elements of A which do not belong to B (although they can belong to C’) and A — C are the 
elements of A which do not belong to C (although they can belong to B), and hence by taking their 
intersection i.e. (A— B)N(A—C)] we select only the elements of A that belong neither to B nor to 
C (by excluding the elements of C' from A— B and excluding the elements of B from A — C) which is 
the same as A — (BUC). 

We may also show this formally as follows: 


A-(BUC) = AN(BUC) = AN(BNC) = (ANA)N(BNC) = (ANB)N(ANC) = (A- B)N(A-C) 


where the first and last equalities are based on the identity of part (a) while the third equality is based 
on Eq. 5. 


(i) 
‘ (BUC)-A=(BUC)NA=(BNA)U(CNA) =(B- A) U(C— A) 
(k) 
A—(BNC)=AN(BNC) = AN(BUC) = (ANB) U(ANC) = (A— B)U(A-C) 
(1) 


(BNC)—-A=(BNC)NA=(BNC)N(AN A) =(BNA)N(CN A) =(B-—A)N(C— A) 


(m) 


BaP -AH (= Cn AS nA = CnA HS (Ba AS CHh) 


C 
where the second equality is based on the identity of part (h). 


(n) 
A-(B-—C)=A-—(BNC) =(A-B)U(A—-©) =(A-B)U(Anc) 


where the second equality is based on the identity of part (k). 


(0) 
U-(AUB)=AUB=ANB 


(p) 


(B— A)U(A— B) =(B_-AN(A-B)=3=U 


(q) It is obvious that what is common between A and AUB is the entire A and hence AN(AUB) = A. 
(r) It is obvious that AM B are elements of A (i.e. those elements which are elements of B as well) 
and A is the entirety of A and hence when we “combine” them (i.e. take their union) we get the entire 
A and hence AU(AN B) =A. 


(s) 


A-B=ANB=BNA=B-A 


PE: Justify the unjustified steps of all the formal derivations given above. 

In Problem 20 we used two methods (or types of argument) to prove or justify the identities (and 
relations in general) of set theory, ie. verbal arguments and formal arguments. Can you propose 
another method? 

Answer: Another common method is the use of Venn diagrams (see § 2.4). So, we can say we have 
three main methods for proving and justifying the identities of set theory: 

e Verbal argument method which is based on demonstrating the logic or rationale behind the 
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identity. The advantage of this method is that it shows the rationale of the identity and hence we 
acquire through this method an intuitive understanding and appreciation which will be helpful in 
various contexts in which the identity (and its alike) is used. The disadvantage of this method is that 
it may not be sufficiently rigorous. Moreover, it may require excessive explanations and justifications 
which are hard to follow and hence it could become susceptible to errors and delusions. Furthermore, 
this method is only applicable to simple identities because complicated identities are generally non- 
intuitive and hence they are not easy (or not possible) to prove by this method because their rationale 
is not simple to grasp or express verbally. 
e Formal (or analytical) argument method which is based on using formal axioms and previously- 
proved identities, and hence the identity in question is proved in an analytical way. The advantage of 
this method is being analytical and hence it is rigorous and less susceptible to errors and delusions. 
Moreover, it is normally neat and concise as well as being general (i.e. it is applicable in principle to 
any identity regardless of its complexity or its other attributes). The disadvantage of this method is 
its potential complexity and difficulty (e.g. it requires a well structured and ordered set of axioms and 
previously-proved identities, and such structures may not be available or easy to maintain or track or 
obtain, and this is especially true in the case of casual proving of solitary identities). 
e Venn diagrams method where the left and right hand sides of the identity are represented (usually 
in stages) by Venn diagrams which, if the identity is correct, should be identical (otherwise the alleged 
identity is false). The advantage of this method is its simplicity (because producing Venn diagrams 
is usually trivial) as well as its visual nature which may help acquiring intuitive understanding and 
appreciation of the identity. Moreover, it should be fast and easy to do in most cases.5! The 
disadvantage of this method is that it is only applicable to simple identities because complicated 
identities are generally difficult (or almost impossible) to express or demonstrate by Venn diagrams 
(i.e. alone although the Venn diagrams can be used partially in conjunction with the verbal argument 
method for instance). 
We should finally note that it may be necessary (or convenient or helpful) to combine the above 
methods (and possibly other methods) to prove or justify some identities where various methods are 
used to build various parts of the proof. So in brief, those who do work on set theory should be aware 
of all these possibilities (and possibly other possibilities) when they intend to tackle a problem of this 
type (i.e. a problem in which a proof or justification is required). Also see Problem 4 of § 2.4. 
PE: Try to prove some of the identities of Problem 20 by the method of Venn diagrams (referring for 
this purpose to § 2.4). 

22. Are you aware of a set operation other than those investigated above?|?6 
Answer: Yes. For example, the Cartesian multiplication (or Cartesian product) of sets A; (¢ = 
1,2,...,m) is defined as the set of all ordered n-tuples (a1, a2,...,@n) where a; € Aj, ag € Ag, 
Gn € An, - 
PE: Form the Cartesian product of the sets A = {a,b,c,d,e, f} and B = {a, 8,7, 6}. 


sees 
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The rules of counting are generally based on the fundamental principle of counting!?’! which states 
that if we have N things (e.g. events or actions or choices or ... etc.) labeled as O1,O2,...,On where 
O; can occur in m, different ways, Og can occur in mg different ways, ..., and Oy can occur in my 
different ways then the number of possibilities for the occurrence of these things is given by the product 


[25lIn fact, this is true if only manual sketches are required. However, the job of producing Venn diagrams could become 
lengthy and excessive if the diagrams are required to be of artistic quality (e.g. to be used in a scientific paper or a book) 
and this should require professional or computerized skills and resources and hence it becomes a disadvantage rather 
than an advantage. 

[26] The purpose of this question is mainly to make the reader aware that there are set operations other than those which 
we investigated, and hence the investigated operations represent what are of interest to us in this book. 

[27] This principle is also known by similar names such as the fundamental counting rule or the principle of counting or the 
rule of multiplication of choices or multiplicative rule of counting. 
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my, X m2 X--:xX my. In fact, we have two main cases for the application of this principle; these cases will 
be investigated in Problem 1. 

The factorial of a positive integer is the product of all the positive integers from 1 to that integer. The 
factorial of a number n is symbolized as n!. Accordingly: 


nm} =1x2x---x(n-1)xn=nx(n-1)x---x2x1 (23) 


By convention, the factorial of 0 is 1, i.e. O! = 1. 

A permutation of a set of objects is a particular arrangement of these objects (noting that “arrange- 
ment” means the order matters). In simple words, a permutation is a set with a given order of its 
members.!?8l For example, ab and ba are two permutations of the set {a, b}.!?9| The number of different 
permutations of m objects taken from a set of n different objects (0 < m <n) is given by: 


! 
PR = "nx (n-1) x x (n—-m41) (24) 
(n —m)! 
An important special case of this formula is P”’ = n!. 

A combination is a particular grouping of objects with no consideration of their order.°! For example, 
if we consider the combinations of two objects taken from the set {a, b,c} then ab and ac are two different 
combinations but ab and ba represent the same combination.?4] The number of different combinations of 
m objects taken from a set of n different objects (0 < m <n) is given by: 

n! nx (n—1)x-++- x (n-—m+1 
tiles _nx(n=1) ( ) (25) 


m™ — m\(n — my)! m! 


By comparing Eqs. 24 and 25 we conclude: 
_ Fn 


m! 


Cy, (26) 

It is worth noting that P?” is the number of different permutations of m objects selected from a set of n 
different objects assuming no repetition (in the selected objects) is allowed. If repetition is allowed then 
the number of permutations is: 


Pie =i" (27) 


Similarly, C7), is the number of different combinations of m objects selected from a set of n different objects 
assuming no repetition (in the selected objects) is allowed. If repetition is allowed then the number of 
combinations is: 

(n+m-—1)! 


a ntm—-1 __ 
Crm = Cm ~ ml (n — 1)! 


(28) 


Further clarifications about these issues will follow (see for example Problems 5 and 30).!?! 


Problems 
1. Give more details about the two cases for the application of the fundamental principle of counting. 
Answer: In one case the number of possibilities for the occurrence of these things are independent of 


each other and hence we multiply the number of possibilities as they are, while in the other case the 
number of possibilities (or some of them) are dependent on each other (and possibly on the sequence of 


In more technical terms, a permutation of m objects taken from a set of n objects (0 < m < n) is a tuple of size m of 
that set. So, if A is a set of size n then its m-permutations are the set of all its m-tuples. 
29] Or rather (a,b) and (b, a). 

In more technical terms, a combination of m objects taken from a set of n objects (0 < m < n) is a subset of size m of 
that set (and hence order is irrelevant). So, if A is a set of size n then its m-combinations are the set of all its subsets 
of size m. 

31] Or rather {a,b} and {a,c}. 

It is important to note that in this book we generally use “permutation” and “combination” to refer to those without 
repetition (in the selected objects) unless stated otherwise. 
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occurrence) and hence we multiply the number of available possibilities considering this dependency. 
For example, if we have to choose a committee of one male and one female from a city council made of 
7 males and 10 females then we have 7 choices for the selection of male and 10 choices for the selection 
of female (i.e. a total of 10 x 7 = 70 possibilities) and these selections are obviously independent of each 
other and of the sequence of selection (because male and female are distinct and hence the number of 
male/female candidates is independent of the number of female/male candidates). However, if we have 
to choose a president and a vice president from this council (regardless of their gender) then we have 
obviously 17 choices for selecting one of them and 16 (not 17) for selecting the other (i.e. a total of 
17 x 16 = 272 possibilities) because the selection of one will affect the number of the available choices 
for the selection of the other (since no one can be president and vice president at the same time) and 
hence the number of available choices is dependent on each other. 
Note 1: the significance of these cases appears in typical situations indicated by words like “repeti- 
tion” and “replacement” (see § 1.5). In general, if the chosen objects can be repetitive (see for instance 
Problem 10) or are replaced (see for instance Problem 14) then the number of choices are independent 
of each other; otherwise the number of choices are not independent. However, these are not the only 
factors that can determine dependency (in fact there are many other factors that can determine de- 
pendency). 
Note 2: in the second case (i.e. when there is dependency) the order of the occurrence is generally 
important, i.e. the number of possibilities could depend on the order of occurrence. For instance, in 
the above example of a city council (made of 7 males and 10 females) if we have to choose a president 
(regardless of gender) and a female vice president then if we start by choosing the vice president then 
we have 10 x 16 = 160 possibilities but if we start by choosing the president then we have either 
17 x 10 = 170 possibilities (if the chosen president is male) or 17 x 9 = 153 possibilities (if the chosen 
president is female). 
Note 3: to avoid unnecessary complications (e.g. related to the difference between the two cases) 
we generalize the fundamental principle of counting as follows: if we have N things labeled as 
O1,O2,...,On where O; can occur in m, different ways, O2 can occur in mz different ways (given 
the choice of O,), ..., and Oy can occur in my different ways (given the choice of O1,O2,...,On-1) 
then the number of possibilities for the occurrence of these things is given by the product m1 x m2 x 
xX my.3 
PE: Give a number of examples for each one of the two cases for the application of the fundamental 
principle of counting. 

2. Use the fundamental principle of counting to justify: 


(a) The rule of permutation (i.e. Eq. 24). (b) The rule of combination (i.e. Eq. 25). 


Answer: 

(a) The first object (of the m objects) can be selected in n different ways (i.e. the number of n ob- 
jects), the second object can be selected in n— 1 different ways (i.e. the number of the remaining n— 1 
objects), ... and finally the m‘” object can be selected in n —(m—1) =n—m-+1 different ways. So, 
by the fundamental principle of counting we get: 


P® =nx (n—1)x-+-x (n—m4 pp eee as aT 


(b) The difference between P” and C” is that the order (of the m objects) matters in P” but not in 
Cy”. This means that we can obtain C?, from P” by dividing P” by the number of permutations of m 


objects taken from m objects, i.e. P™. Now, from part (a) we have P™ = —™4 — = m! (noting that 


(m=m) 
0! = 1) and hence: 


Pe a: 
m Pm ml! m!(n—m)! 


[33] We note that the order of selection (if there is order) is considered by the condition “given the choice of”. If there is no 
order and the choices are not mutually independent then we should consider the dependency of the choice of each object 
on the choice(s) of other object(s) when we have such a dependency (which can possibly be partial as well as multiple). 
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PE: Use the fundamental principle of counting to find and justify the number of possible outcomes 
when you: 


(a) Throw a die n times. (b) Throw m coins simultaneously n times. 

(c) Throw m dice simultaneously n times. (d) Throw a die and a coin simultaneously n times. 
3. State the sum rule of counting for a number of pairwise disjoint sets. 

Answer: If A,;, Ao,..., An are pairwise disjoint sets then the number of elements of their union is the 

sum of the numbers of elements of these sets, that is: 

N=N,+ No+-:::+ Ny (A;N A; =) 
where N is the number of elements of their union (i.e. Ay U AgU...UA,) and Ni, No,..., Nn are the 
numbers of elements of these sets (with i,j € 1,2,...,n andi # J). 


PE: What change should we introduce on this rule of counting if the sets are not pairwise disjoint? 
Consider in your answer only some simple cases (e.g. 5 pairwise disjoint sets except 2 of them which 
have common elements). 

4. Express the factorial of n in another commonly used mathematical form. 
Answer: If we use the product symbol IT then we have: 


PE: Investigate the relationship between the factorial and the gamma function (if you ever heard of 
the gamma function). 

5. Let “permutation” mean “selection with order” and “combination” mean “selection with no order”. Also, 
let “repetition” mean “repetition in the selected objects”. State the formulae for the number of different 
choices that can be made when selecting k objects out of n different objects (considering the factors 
of order and repetition). 

Answer: We have four formulae: 


Pe = @_®! (permutation with no repetition) (29) 
n—k)! 
Pe (permutation with repetition) (30) 
! 
Ch = oo (combination with no repetition) (31) 
(n — 
k—1)! 
CSCS feet): (combination with repetition) (32) 


Also see Problem 30. 

PE: Try to identify a different type of repetition, i.e. other than “repetition in the selected objects”. 
6. Find the condition for: 

(a) P? =C?P. (b) P? =n™. (c) Ch =n™. 

Answer: We note that m <n. 


(a) 


2 n! n! i 
Pin = (n—m)! —— m\(n—m)! = Om 
i. 
1 om! 
m = 1 


i.e. m=0,1. So, we have Pj)’ = C} =1 (n> 0), and P/ = CP? =n (n> 1). 


(b) 
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10. 


nmxX-+X(n—m+1) = nx---xn 


ie. m= 1. So, PP? =n! =n (n> 1). Also noting that P? = n° = 1 (n > 1) we should also have 
P? =n° (n> 1). So in brief, P? =n™ for n > 1 and m=0,1. 


(c) 


n! 


Cia =— m 

mm m!\(n — m)! - 
n (n-—m-+1) 
a na SE ee => nmxX--Xn 
m 1 


ie. m= 1. So, CP = n't =n (n> 1). Also noting that C? = n° = 1 (n > 1) we should also have 
Ch =n° (n> 1). So in brief, C?, = n™ for n > 1 and m=0,1. 
PE: Justify all the details of the above arguments. 


. Find the following numbers of permutations: P), P?, P”, Pi, P}, P}. 


Answer: We use Eq. 24: 


0! n! n! 
P® = =| pr= ={ pra aa 
a> (20) 0 ~ (n— 0)! "~(—n! 
7! 9! 13! 
Pi = ——__ = 210 Py Sa 7? Pi = 1935520 
a (7-3)! 4 (9-2)! & ~ (13 —6)! 
PE: Find the following numbers of permutations: P’_,, P"';, P?’, P3?, P2>, P?°. 


. Find the following numbers of combinations: C§, C?, OC", C3, C8, CH. 


Answer: We use Eq. 25: 


0! n! n! 
Co = —_— __ =] Cc? = —_—"__ =] COU ee | 
eo O!(0— 0)! o 0l(n— 0)! "— ni(n—n)! 
5! 8! 11! 
C2 (5 — 2)! . os 51(8 — 5)! o C4 A!(11 — 4)! Hee 
PE: Find the following numbers of combinations: C™_,, O7t3, CP, C33, CH, C. 


. Plot C7), as a function of n and m for 1 < n < 12 to appreciate how C7, varies with n and m. 


Answer: See Figure 1. We note that because of the large range of C”, (i.e. on the vertical ‘z’ axis) 
some details are obscured. 

PE: Plot P” as a function of n and m for 1 <n < 7. 

How many 5-letter strings!#4! can be made from the 26 letters of the English alphabet if (a) repetition 
is allowed (b) repetition is not allowed? 

Answer: 

(a) If repetition is allowed then we have 26 possibilities for each one of the 5 letters. Hence, by the 
fundamental principle of counting the number of possibilities of 5-letter strings is 26° = 11881376. In 
fact, this is just P26 5. 

(b) If repetition is not allowed then we have 26 possibilities for the 1°* letter, 25 possibilities for the 
2"4 letter, 24 possibilities for the 3"¢ letter, 23 possibilities for the 4°” letter, and 22 possibilities for 
the 5°” letter. Hence, by the fundamental principle of counting the number of possibilities of 5-letter 
strings is 26 x 25 x 24 x 23 x 22 = 7893600. In fact, this is just P?°. 

PE: How many different choices we have in drawing 2 red balls and 3 blue balls from a bag containing 
7 red (numbered) balls and 9 blue (numbered) balls considering the cases of (a) with replacement and 
(b) without replacement? 


[34] We mean by “strings” arrangements of letters which are not necessarily sensible words. 
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1000 


29 


Figure 1: The plot of Problem 9 of § 2.2. 


11. List (in a compact way) all the possibilities of the 5-letter strings in parts (a) and (b) of Problem 10. 
Answer: If W symbolizes the set of these possibilities then we have: 


12. 


13. 


(a) W= {apy6e : a, B,y,6,€ € {English alphabet}. 


(b) W = {apqd= : a, B,y,6,€ € {English alphabet}, aA BAYASF e} where # means “different” 


or “not the same”. 


PE: List (in a compact way) all the possibilities of the choices of the PE of Problem 10. 
List all the 3-letter strings that can be made from the letters {a,b,c} if (a) repetition is allowed (b) 


repetition is not allowed. 


Answer: 

(a) We have 3° = 27 distinct strings which are: 

aaa, bbb ccc aab aac 
bba bbe bab beb abb 
cbc acc bec abc acb 
(b) We have P} = 6 distinct strings which are: 

abc acb bac 


aba aca baa caa 
cbb cca ccb cac 
bac bea cab cba 
bea cab cba 


PE: List all the 3-digit numbers that can be made from the digits {1,2,3} if (a) repetition is allowed 


(b) repetition is not allowed. 


List all the 3-digit (a) permutations (b) combinations of the digits {1,2,3,4}. 


Answer: 
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14. 


15. 


16. 


17. 


(a) We have P# = 24 permutations which are: 
123 124 132 134 142 143 213 214 231 234 241 243 


312 314 321 324 341 342 412 413 421 423 431 432 
(b) We have Cf = 4 combinations which are: 

{1,2,3} {1,2,4} {1,3,4} {2,3,4} 
PE: List all the 3-letter (a) permutations (b) combinations of the letters {a,b,c,d}. 

In a gambling game, 5 balls are drawn randomly from an urn containing 9 balls (numbered from 1 to 
9). What is the number of possibilities for this game if: 

(a) The balls are drawn at once. 

(b) The balls are drawn in sequence with replacement (ie. each drawn ball is returned to the urn 
before drawing the next ball). 

(c) The balls are drawn in sequence with no replacement. 

Answer: 

(a) Because the balls are drawn at once there is no order and hence this is a problem of combination. 
So, what is required is to find the number of combinations of 5 objects taken from a set of 9 objects, 
ie. C2 = 126 possibilities. 

(b) We have order (as suggested by “sequence”) with replacement, and hence we have 9 possibilities for 
each one of the 5 balls. So, by the fundamental principle of counting we have 9° = 59049 possibilities. 
(c) We have order (as suggested by “sequence”) with no replacement, and hence this is a problem of 
permutation.!3°! So, what is required is to find the number of permutations of 5 objects taken from a 
set of 9 objects, ie. P? = 15120 possibilities. 

PE: How the results of this Problem will be affected if the balls are colored: 3 blue, 3 green and 3 
red? Justify your answer eloquently. 

Classify the following as permutation or combination problems: 

(a) Using three (different) medicines: one at morning, one at midday and one at night. 

(b) Selecting five members of parliament for a parliamentary committee. 

(c) Selecting a president, a prime minister and a secretary of state (from a ruling party). 

Answer: 

(a) This is a permutation problem because the (chronological) order is important. 

(b) This is a combination problem because there is no indication of significance of order (i.e. the 
members of committee are selected and treated equally). 

(c) This is a permutation problem because the (ranking) order is important. 

PE: Repeat the Problem for the following: 

(a) Selecting a football team for a match from the players of a football club (assuming any player in 
the club can take any role in the match). 

(b) Choosing 5 cars of different models from 20 available models. 

(c) Aligning the students of a class in a queue. 

Two women and three men are to be selected from a list of candidates made of 13 women and 17 men. 
How many possibilities we have for this selection? 

Answer: The order in the selection of women and in the selection of men is irrelevant and hence we 
have C° possibilities for the selection of women and Ci" possibilities for the selection of men. So, by 
the fundamental principle of counting the number of possibilities for the selection of 2 women and 3 
men is the product of C3% and C4", i.e. C33 x CH = 78 x 680 = 53040. 

PE: How the result of this Problem will change if: 

(a) The gender of the selected 5 is irrelevant. 

(b) We have to remove 4 men candidates before selecting the 2 women and 3 men. 

A passport identification number is made of an uppercase English letter followed by a 7-digit number. 
How many distinct passports can be issued if: 


[35] In fact, being a problem of permutation or combination depends on the rules of the game (and this should apply even to 


part b). However, we treated this as a problem of permutation (due to the strong suggestion of “sequence’”). 
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18. 


19. 


20. 


(a) The letter is consonant and the digits are not repetitive. 

(b) There is no restriction on the letter and the digits but the 7-digit number must start (from left) 
with a non-zero digit. 

Answer: 

(a) We have 21 possibilities for the letter (because we have 21 consonant letters) and P?° possibilities 
for the 7-digit number (because we have 10 possibilities for the 1° digit, 9 possibilities for the 2”¢ 
digit, ..., and 4 possibilities for the 7’” digit). Hence, by the fundamental principle of counting the 
number of distinct passports is: 


10! 
21 x Pr® = 21 x ——__ = 12700800 
£ (10 — 7)! 
(b) We have 26 possibilities for the letter (because we have 26 letters in the English alphabet), 9 
possibilities for the 1%’ digit (because it cannot be 0) and 10 possibilities for each one of the remaining 
6 digits of the 7-digit number (because we have no restriction on them). Hence, by the fundamental 
principle of counting the number of distinct passports is: 


26 x 9 x 10° = 234000000 


PE: Repeat the Problem for the following cases: 
(a) There is no restriction on the letter and the digits but the 7-digit number cannot be less than a 
million. 
(b) The letter cannot be O or I (to avoid confusion with 0 and 1) and the digits cannot be 0. 
The constitution of a (liberal democratic) country states that at least 1/3 of the members of parliament 
must be women. If the number of members of parliament is 60, how many possibilities we have for the 
representation of women in this parliament? 
Answer: 1/3 of 60 is 20. Hence, the number of women must be between 20 and 60 (inclusive). This 
means that we have 41 possibilities, i.e. 60 — 19 = 41. 
PE: Repeat the Problem if the constitution requires that both genders must be represented and the 
representation of women must not be less than 1/4. 
A football club consists of 2 goalkeepers and 15 players. How many possibilities we have for selecting 
a team (of 1 goalkeeper and 10 players) for a match? 
Answer: There are two possibilities for choosing the goalkeeper and Cj for choosing the 10 players 
(noting that the order is irrelevant) and hence by the fundamental principle of counting the number 
of possibilities is 2 x Clg = 2 x 3003 = 6006. 
PE: What if the club consists of 2 goalkeepers, 4 defenders, 6 midfielders and 5 strikers and we want 
a team of 1 goalkeeper, 3 defenders, 4 midfielders and 3 strikers? 
Mention a well-known use of C7’, in algebra. 
Answer: C” appears in the expansion of algebraic expressions like (~+y)” by the binomial theorem, 
ie. 

(ety) =o Cha kyk = Chery? + Cpe ty +. +ORa hy be OR arty" t+ Cney” (33) 

k=0 


For this reason, C7), is also known as the binomial coefficient. 

Note: a simple mathematical device for calculating the binomial coefficients is the Pascal triangle!?4 
(see Figure 2) which is characterized by (and constructed according to) the following: 

e The two edges of the triangle are 1’s. 

e The n“” row of the triangle has n+ 1 entries (where n = 0,1,2,...). 


[36] In fact, the Pascal triangle has uses and benefits other than (and more important than) calculating the binomial coef- 


ficients such as demonstrating their relations and patterns (like symmetries) and deriving mathematical relations and 
identities involving these coefficients. We also note that calculating the binomial coefficients by the Pascal triangle is 
practical only for small coefficients (and it is generally trivial). 
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21. 


22. 


23. 


n=0 1 

al 1 i 

r=2 it Z 1 

n=3 1 3 3 1 

n=A4 1 4 6 4 ih 

— i? 5 10 10 5 1 
n=6 il 6 15 20 15 6 1 


Figure 2: The first 7 rows of the Pascal triangle. See Problem 20 of § 2.2. 


e Each internal entry is the sum of the nearest two entries (on its two sides) in the row above it. 

e The entries in the n“” row of the triangle are the binomial coefficients in the expansion of binomial 
expressions like (x + y)”. 

e The triangle is symmetric (by mirror reflection) in the vertical line passing through its top vertex. 
PE: Investigate the potential use of permutations in the mathematics of group theory, number theory, 
genetics, computer science, cryptography and fractals (as well as other branches and fields of pure and 
applied mathematics and related disciplines). 

Show that for any n € N we have (with a; being constants): 


(a) 3” = Sopao ax 2". (b) 5° = Yypao ae 2”. 
Answer: We use the binomial theorem (Eq. 33). 


(a) 


3” = (14+2)” =p = yo ce2 = Sau (ax = CP) 


5” = (14+4)" =oep tat = Stop ak = So cp = Soa, 2* (a, = CT) 
k=0 k=0 k=0 


PE: Show that for any n € N we have 9" = )7;'_ a,2°*. Also give a general formula representing the 
pattern seen in the examples of this Problem. 

Rewrite the Pascal triangle of Figure 2 in terms of the combination symbols (i.e. binomial coefficients). 
Answer: See Figure 3. 

PE: Describe the pattern of the combination symbols in Figure 3. Also construct the next 3 rows 
(corresponding to n = 7,8,9) of this triangle. 

Prove the following permutation identities: 


(a) PP =n. (b) P? , =n! (n> 1). (c) Be = PR 
(d) P® =nP"—}. (e) PP /PR_,=n-—k+1. (0. PEP eee yy SP 
Answer: We use Eq. 24. 
(a) We have: 
! ! ! 
pre n AO 
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n=0 Ce 

T= 1 Ch Ct 

a, Ce e@ @ 

n= Ce C3 C3 C3 

fis ac cd adseass«a 

ns co fF & & & © 
n=6 Co co co co Cc Cc ¢ 


Figure 3: The first 7 rows of the Pascal triangle in terms of the combination symbols. See Problem 22 of 
§ 2.2. 


(b) We have: 
n! nl onl , 
Gane Ow Ae 


(c) This can be obtained by combining the results of parts (a) and (b). 


Mo 
Pi-1= 


(d) We have: 
— 1)! n! 
pr-t = (n _ = pn 
Nem-1 ONS C= aac in| (n—m)! m 
(e) We have: 
n ! = ' = 
Pi _ gt K+ 1)! _ (n k+1)! Bat 
Py (n—k)! n! (n—k)! 
(f) We have: 
n n ? ‘A 
pr +kpr, & prt 
n! n! >  (n+1)! 
ee = Eq. 24 
G= ht =k) AHI —) (Eq. 24) 
iE | bei |. Gace)! _ 
(n—k+1)! | (n —k)! | = (n+1—h)! (factorizing) 
_ . 
eae +tk=n+1 (canceling) 
n—k)! 
(n—k+1)+k = n+1 (simplifying) 
n+l=n-+1 


PE: Prove part (f) by another method. 
24. Find the unknown xz in the following (where zx is a positive or non-negative integer): 


(a) PP = al. (b) P?_, =n. (c) 6P2_, =n. (d) 3P? = Pg. 
Answer: 
(a) We have: 

PrP = gi 


x 


nx (n—1)x---x (n-—a2+4+1) 


I 


xx (a@—-1)x---xK1 
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25. 


So, by comparison we get x = n. 


(b) We have: 
PP, = ni 
n! n! 
(n—24+1)! 1 


So, by comparing the denominators we have either n -«+1=0 and hence x =n+lorn—x+1l=1 
and hence « =n. 


(c) We have: 
6Py3, = nl 
- n! 
Py_3 7 31 
“ux(a@—-1)x---x4 = nx(n-1)x-:-x4 
So, by comparison we get 7 =n (n > 3). 
(d) We have: 
3PP = PF 
3[ x(a —1)(x — 2)(w@ —3)(@—4)]| = x(a —1)(@ — 2)(x — 3)(a — 4) (a — 5) 
3 = («—-5) (x £0,1,2,3,4 since x > 6) 
x = 8 


PE: Form and solve two other relations (involving permutations with an unknown) similar to the 
relations given in this Problem. 


Prove the following combination identities: 
(a) Cr_,, =Ch. (by) Ch 7 ROR — Cnr, (c) C, a Of amee =n O49 
(d) OF, + Char = Omit (9) Cra FC (f) Cr nce = CCR 
n 4 m n n n 2 
(g) Cyt 5 na Gen Cos (h) iem Cm = Crees (i) C7" = = p= = o (CZ ) : 
Cpa Hee (k) aie (ep =O (1) CZ" = 2C} +n? 
(m) C3" = 8Cy + 8n?. (CA SDPO (0) Dro OR” = ch 
Answer: 
(a) From Eq. 25 we have:!37| 
a n! n! n! . 
Cn—-m = (n—m)!(n—[n-—m])! (n—m)!m! m!i(n—m)! | Cm 


(b) This is the Pascal identity which produces the interior entries of the (n +1)!” row of the Pascal 
triangle (see Problems 20 and 22) from the entries of the n“” row of the triangle./°8] From Eq. 25 we 
have: 


a ‘ t n! n! 
Oma + Om (m—1)\(n—m-+1)! ° mi(n—m)! 
2 mx ni (n—m+1) xn! 


mi(n—m+l1)!° mi(n—m+1)! 


[37] This equation is intuitive because for each combination of m objects (chosen from n objects) there is a corresponding 


combination of n — m objects (left out of the combination of m objects) and hence the number of the two combinations 
must be equal, ie. CP_) = Cy. 


[38] In fact, this identity can be inferred from Figures 2 and 3 (noting that the triangle is actually constructed using this 


identity). 
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[m x n!] + [(n-—m +1) x nl] 
mi(n—m +1)! 
ni[m+(n—m-+1)] 
m\(n —m +1)! 
ni(n + 1) 
mi(n—m-+1)! 
(n+ 1)! n+1 
m!\(n+1—m)! = Cn" 


(c) This is the same as the identity of part b (i.e. Pascal identity) with n — 1 replacing n. 
(d) This is the same as the identity of part b (i.e. Pascal identity) with m+ 1 replacing m. 
(e) We have: 


(Serene a (n— 1)! _ n! =e 
ea ar (m—1)!(n-1—m4 Di naom. 
(f) We have: 
Pee o n! m! ni 1 mei 1 
Cnn = ia GER Ga a A = al 
_ n! (n — k)! n! (n — k)! — ongn-k 


kl(n—k)! * (m—k)\n—m)l E(n—-B!~ (m—Bl(n—-k—m+h)! 


(g) This is the Vandermonde identity. Let assume that we have m members of a given club and 
nm non-members and we want to form a team of q individuals from these m+ n individuals. It is 
obvious that we have Cnr ways for the formation of this team (which is the left hand side of this 
identity). However, we can look to the formation of this team differently by taking p individuals from 
the members (which can be done in Cj” different ways) and q — p individuals from the non-members 
(which can be done in Cs different ways) and hence we have C’™C”_,, ways for the formation of 
this team. Now, since p can vary between 0 and gq (inclusive) then the total number of ways for the 
formation of this team is 0 Ci Ci_» (which is the right hand side of this identity). So, the left 
hand side and the right hand side of this identity are equal (as required). 

(h) This is the hockey-stick identity. We have: 


n 


Soh, =Cm+omtt + oma 4... 4 omts m+k=n) 
=14+C07 + comP 4... om C™ =1) 
=omtt+omet 4 omer y...-omrk Cutt aT) 
= Cry RO hae LOn Pascal identity) 
=ontit-.--+ent* Pascal identity) 


ae aa (Pascal identity) 
=O (m+k=n) 
(i) We have: 
C2n — Onin 
= S- Cr Og (Vandermonde identity with m = q =n) 
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= ye Cher (see part a) 


So, this is a special case of the Vandermonde identity with m = q =n. 
(j) From Eq. 33 with x = y = 1 we have: 


ede Coat r= OF 
k=0 k=0 


(k) Let us prove this by induction. For m = 1 we have: 


1 
Scone = ce +p =14 (ntl) =n4141=C7") =cntmh 
p=0 


So, it is true form = 1. Now, let assume that it is true for m and hence we must check if it is true for 
m +1, that is: 


m+1 m 
5 ope (Soop) a catee 


p=0 p=0 


cee aa ae ok Okan ae (because it is presumably true for m) 
= Ce 2 (Pascal identity) 


= cnt )+1 


m+1 


So, it is true for m+ 1 and hence by mathematical induction it is true in general. 

(1) Let have a collection of 2n balls: n of which are black and n of which are white. Now, the number 
of all subsets of size 2 balls in this 2n-ball collection is given by C3” (which is the left hand side of this 
identity). The number of all subsets of size 2 balls is also given by the sum of all 2-black combinations 
(which is C3’), all 2-white combinations (which is C2’), and all black-white combinations (which is n?), 
and hence this sum is given by 202 + n? (which is the right hand side of this identity). Accordingly, 
C3" = 2Ch + n?. 

(m) Let have a collection of 3n balls: n of which are black, n of which are white, and n of which are 
red. Now, the number of all subsets of size 2 balls in this 3n-ball collection is given by C$" (which is 
the left hand side of this identity). The number of all subsets of size 2 balls is also given by the sum of 
all 2-black combinations (which is C3’), all 2-white combinations (which is C3’), all 2-red combinations 
(which is C2), all black-white combinations (which is n), all black-red combinations (which is n?), 
and all white-red combinations (which is n?), and hence this sum is given by 3C% + 3n? (which is the 
right hand side of this identity). Accordingly, C3” = 3C® + 3n?. 

(n) We have (noting that n and m are integers with n > 0 and m > 0 and we use in the first line the 
definition of the binomial coefficient for negative powers): 


con =f n)(—n — 1)(—n — 2)---(—-n-—m+2)(-n-m+1) 


ml 


(binomial coefficient) 


= (-1)” daa i eee!) (taking out m factors of — 1) 
=(-1)™ a a as sore, Mat (reversing the order) 
(n+m—1)(n+m-—2)---(n+2)(n+1)n x (n—-1)! 


[ x (n—1)}] 


m!(n—1)! 
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=(-1)™ wa (Eq. 23) 
= (alc (Eq. 25) 
(0) We have: 
n-1 
Soop" = Ce + Cet + cg? +... tg! 
r=0 
=14+ OR + OR? +... 4.08" 1 (Ck = 1) 
me OS arta 6 ieeaer er 6 garnet Oakes (ct! =1) 
= Che oe Geka aaaiea Cre. (Pascal identity) 
=Cei+:- +07" (Pascal identity) 


= k+n-1 k+n-1 
=Chi, OC, 


- = One (Pascal identity) 


PE: Do the following: 

(a) Mark the Pascal triangle (see Figures 2 and 3) in a way that indicates the Pascal identity and 
demonstrates its validity. 

(b) Mark the Pascal triangle (see Figures 2 and 3) in a way that indicates the hockey-stick identity 
and demonstrates its validity. 

(c) Prove the Vandermonde identity algebraically. 

(d) Prove the hockey-stick identity by induction. 

(e) Can we conclude from parts (1) and (m) that Cf" = kC}? + Ck n? (k = 2,3,...)? 

26. Show that the total number of combinations of n objects taken 1,2,..., at a time is (2” — 1). 
Answer: From part (j) of Problem 25 we have C? +C?'+---+C7 = Qn, Now, if we note that C? = 1 
then we get C7 +---+C? = 2”—1, ie. the total tnibet of eanibinatols of n objects taken 1,2,..., 
at a time is (2” — 1) as required. 

PE: Verify this result computationally (using a programming language or a spreadsheet) for the cases 
n=1,2,...,10. 

27. We have 16 (numbered) balls and three bags where the capacity of these bags are 3, 5 and 8 balls. 
How many possibilities we have for filling the three bags with the 16 balls? 

Answer: In a sense, this is a combination problem (since the balls in each bag are not ordered sets). 
Accordingly: 
e We can fill the 3-ball bag in C3° ways. 
e We can fill the 5-ball bag in C3? ways (noting that 3 balls are already put in the 3-ball bag and 
hence we have only 13 balls for filling the 5-ball bag). 
e We can fill the 8-ball bag in C$ ways (noting that only 8 balls are available for this filling). 
Thus, by the fundamental principle of counting the number of possibilities for filling the three bags is: 
1 1 16! 13! 8! 16! 

Ca" x Cs x Ce = aia * arat ™ gigi = glared = 20720 
We may also argue differently as follows: we have 16! permutations for filling the bags (i.e. 16 x 15 x 14 
for filling the 3-ball bag, 13 x 12 x --- x 9 for filling the 5-ball bag, and 8 x 7 x --- x 1 for filling the 
8-ball bag). However, because the order in each bag is irrelevant, then we divide 16 x 15 x 14 by 3!, 
divide 13 x 12 x --- x 9 by 5!, and divide 8 x 7 x --- x 1 by 8!, and hence we get the same result. This 
argument can be presented formally as follows: 


Pee PP PE AG KAS KAA. IS R12 Re RO: OR TK a8 eT 16! 
x x= x x = 
3! 5! 8! 3! 5! 8! 3! 5! 8! 
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28. 


29. 


Also see Problem 30. 
Note: it may be suspected that the order of filling the bags may affect the number of possibilities. 
However, this is not the case (partly because of what we got in part a of Problem 25, i.e. C?_,, = CM). 
For example, if we reverse the order then we get: 
16! 8! 3! 16! 
16 8 ac - _ 

Cs C5 * C3 = rar ™ ara * gro = ateiat (2070 
PE: Repeat the Problem for 29 cards to fill 4 bags of capacities: 9, 6, 4 and 10. 
Show that the number of possibilities for filling k slots with n objects where the slots have capacities 
N1,N2,...,n, (with ee. ny = Nn) is: 


n! 
n = 
Crna nay. stte = ny! Ng! ... np! 
Answer: If we follow the method of Problem 27 then we may argue as follows: 
e We can fill the n,-object slot in C7, ways. 
e We can fill the nz-object slot in C7"! ways (noting that n; objects are already put in the nj-object 


slot and hence we have only n — n; objects for filling the n2-object slot). 
e We can fill the n,-object slot in Cn a oee = Chr ways. 
Thus, by the fundamental principle of counting the number of possibilities for filling the & slots is: 


n nn m—(nite-tne-1) 
Ch x Che te OPN = 


n! (n=m)t n—({n - Nk-1) | 
n(n)! ne! (n=R1r=T9)! ny! 0! 


n! 


n1!ng!... np! 


We may also argue like our second argument in Problem 27 and we get the same result (also see 
Problem 30). 
Note: Cy) ns,....n, 18 @ generalization of the binomial coefficient and hence it is called multinomial 
coefficient (noting that the binomial coefficient corresponds to two slots one of capacity m and one of 
capacity n —m). As a generalization to the binomial theorem (see Problem 20), we have the following 


multinomial theorem: 
Gi ey oe ag)” SOF po eh ae (34) 


where the sum is taken for all ny + ng +--- +n, = n (for non-negative n1,n2,--- ,n%). Also see 
Problems 29 and 30. 

PE: Do the following: 

(a) Give a number of examples of specific “objects” and “slots” to which this type of “combination” 
applies (noting that “objects” and “slots” are generic labels that can apply to many things and concepts). 
(b) Calculate C33 45.6 and C§4113- 

(c) Expand (x + y + z)? algebraically and hence verify the multinomial theorem for this case. 

Show that the binomial coefficient is a special case of the multinomial coefficient. 

Answer: For k = 2, the multinomial coefficient is (noting that ny + ng =n): 


n! n! 


n = pane — mM __ n 
Cr ne = mine! ny!(n— 14)! Ona = Ong 
PE: Noting that C7, = Sey eee = Ci nz» What can you conclude? Make a comment. 
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30. Show that Cy n,,....n, Tepresents the number of permutations of n objects such that: n1 of which are 
identical (or indistinguishable), nz are identical, ..., and nz are identical (with ae ny =N). 


Answer: We may argue that being identical is like being stored in a slot where the objects in that slot 
are not distinguished from each other by order or anything else, and hence we can use the rationale 
and derivation of Problem 28. We may also argue (more appropriately for this purpose) that if the n 
objects are different then the number of their permutations (i.e. P’) is n!. However, because n, of 
these n objects are identical then the permutations of the n; objects (whose number is n,!) within the 
n-object permutations are identical and hence we should divide n! by n;! to get the number of distinct 
permutations of the n objects. This similarly applies to the ng,...,n% identical objects within the n 
objects. Therefore, the number of distinct permutations of such n objects is: 


! 
n! ‘ 


21,22,--.,Nk 


nz! ng!... nz! 


Accordingly, CT), no,...n, represents the number of permutations with repetitions (i.e. repetition 
of n1 objects, repetition of ng objects, ..., repetition of nz, objects). 


It is important to note that if the n objects consist of some repetitive (or identical) objects and some 
non-repetitive objects then each non-repetitive object contribute a factor of 1! in the denominator and 
hence these factors can be ignored. For example, if we have n; identical objects and nz other identical 


objects as well as 3 distinct objects (and hence n = nj + nz +3) then from the formula of Cy n,n, 


we get: 
n! n! 
nm — nr —_ — > nr 
Crrna,ane = Crna, tnt = mlngl 11! ny!ng! Cina 
So, based on this understanding C7, »,..2n, Can apply even for en n, <n. In fact, we can even 
consider Py’ = n! (for n distinct objects) as a special case of Cy)... n, because when we have n 
distinct objects then ny = no =--- = nz = 1 and hence: 
Crivn iek = CT 11 = 1! 1! : 1! = Ne = Pe 
Accordingly, we can consider C7), ,,n, a8 the more general symbol (and mathematical entity) for 


representing the numbers of combinations and permutations involving and considering n objects. 


We also note that C7, n,n, (which represents the number of permutations with repetitions regard- 


less of the condition n; +ng+...+nz =n, ie. possibly n; <n) reduces to the multinomial 
coefficient (see Problem 28) when nj +n2+...+np =n. 

Note: the reader should be aware that the term “permutations with repetitions” (and its alike) may 
be used differently where “repetition” means the repetition of the individual objects in selections of 
multiple objects (rather than repetition in the n objects of the set from which the selections are made), 
and hence the number of “permutations with repetitions” (of n objects taken k at a time) in this sense 
is Py, = n* (see Eq. 30) as opposite to the number of “permutations with no repetitions” which is 
given by Py’ = n!/(n—k)! according to Eq. 29 (and as compared to the number of “permutations with 


repetitions” in the above sense which is given by C7), ,.,n,)- So in brief, the number of “permutations 
with repetitions in the objects of the selection set which is made of n objects” is C7), n,n,» and the 


number of “permutations with repetitions in the objects of the selected set which is made of k objects” 
is P,,.!29! In fact, it is better to adopt two different terms for these two types of “permutations with 
repetitions” (e.g. one is called “permutations with repetitions” and the other is called “permutations 
with duplication”). However, we do not want to go against the literature in this issue. Also see Problem 
5. 

PE: Calculate the following [noting that if n > (nj +---+n,) then the remaining n— (nj +--+ +x) 


[39] Tt should also be noted that in Ch, .no,....n, the entire set of n objects are taken in the permutation (and hence we have 
“n-permutations” out of n objects), while in P,, a set of size k objects (out of a set of size n objects) are taken in the 


permutation (and hence we have “k-permutations” out of n objects). 
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31. 


32. 


33. 


34. 


are distinct or non-repetitive] : 

(a) Cis: (b) C5. (c) C3. (d) C36: 
Find the number of distinct strings that can be formed from the letters of the word (a) “attachment” 
(b) “subsequently” (c) “incomprehensibility”. 

Answer: Referring to Problem 30: 

(a) The word “attachment” contains 10 letters where “a” is repeated 2 times and “t” is repeated 3 
times. Therefore, the number of distinct strings that can be formed from the letters of “attachment” 
is AQ, = 302400. 

(b) Following the method of part (a) we have seta = 59875200. 

(c) Following the method of part (a) we have iota = 1267136462592000. 

PE: Repeat the Problem for the words (a) “Sussex” (b) “explicit” (c) “indistinguishable”. 

Let have 7 drawers and 5 shirts and we want to store these shirts in (5 of) these drawers restricted 
by the condition that no more than 1 shirt can be stored in any drawer. How many ways we have for 
storing these shirts? 

Answer: We have Ciways for selecting 5 drawers (out of 7) to be used for storage, and we have 5! 
ways for distributing the 5 shirts on the selected 5 drawers (i.e. the first selected drawer can be used 
for storing any one of the 5 shirts, the second selected drawer can be used for storing any one of the 
remaining 4 shirts, and so on). Hence we have: 


7! 7 
ee aie {— 
TG a ea 


Olea = = Pla 2520 


(7-5)! 
ways for storing these shirts. 
We may also argue (differently) that we have 7 drawers to be assigned to 5 shirts where we have 7 
ways for assigning a drawer to the first shirt, 6 ways for assigning a drawer to the second shirt, ..., and 
3 ways for assigning a drawer to the fifth shirt and hence we have: 


7! 7 

7X6x5x4x3= (7-5)! = Ps = 2520 
ways for storing these shirts.. Also see Problem 37 of § 3.2. 
PE: Repeat this Problem assuming we have 6 cars to be stored in 9 garages where each garage can 
accommodate no more than | car. 
Let have 7 shirts and we want to store 5 of them in a drawer and the remaining 2 in a second drawer. 
In how many ways this can be done? 
Answer: This is obviously a combination problem because we want to select 5 (non-ordered) shirts 
out of 7 shirts for storage in the first drawer (noting that the remaining 2 shirts will inevitably go to 
the second drawer). Hence, we have Cf = 21 ways. We may also argue (differently) that we want to 
select 2 (non-ordered) shirts out of 7 shirts for storage in the second drawer (noting that the remaining 
5 shirts will inevitably go to the first drawer), and hence, we have CS = 21 ways as before (noting that 
C3 = Cf; see part a of Problem 25). 
PE: Solve this Problem as a permutation with repetition problem (see Problem 30). 
In statistical mechanics we have three main types of statistics (or distribution): 
(a) Maxwell-Boltzmann statistics for (classical) distinguishable particles. 
(b) Fermi-Dirac statistics for (quantum) indistinguishable particles that cannot occupy the same state 
(e.g. electrons). 
(c) Bose-Einstein statistics for (quantum) indistinguishable particles that can occupy the same state 
(e.g. photons). 
If we have n particles and k available states (where these n particles are supposed to occupy some or 
all of these k states), find the number of possibilities of occupancy for each one of these statistics. 
Answer: 
(a) For Maxwell-Boltzmann statistics each one of the n particles can occupy any one of the k states 
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and hence we have n of k possibilities, ie. k” possibilities.!4°l This is because the assignment of each 
particle to any state is independent of the assignment of the other particles. For example, if we have 5 
particles and 3 states then the first particle can occupy state 1 or 2 or 3 and .... and the fifth particle 
can occupy state 1 or 2 or 3 and hence we have 3° possibilities for occupancy. 

(b) For Fermi-Dirac statistics we have C* possibilities for occupancy (with the restriction that k >n 
because each available state cannot be occupied by more than one particle and hence the available 
states must be at least as many as the number of particles). This is because we have C* ways for 
selecting n states (out of k states) to be assigned to the n particles noting that the particles are 
indistinguishable. |##] 
(c) For Bose-Einstein statistics we have C?+*~! possibilities for occupancy. To explain and justify 
this let us use a simple example where we have 5 balls to be stored in this way (i.e. by the Bose-Einstein 
statistics) on a shelf that is divided to 8 compartments by 7 (i.e. 8 — 1) interior (movable) separators 
(and contained between two walls). There are many distinguishable configurations (or arrangements) 
for the occupancy of these balls in the 8 compartments. To demonstrate this, we present in Figure 4 
four of these arrangements. 

As we see, the balls (on one hand) and the separators (on the other hand) are not distinguished in this 
type of storage (because all we are interested in is the number of balls in each particular compartment), 
and hence each one of these arrangements is distinguished by the positions of the separators between 
the balls (or similarly by the positions of the balls between the separators). In other words, we are 
interested in the combination of the 8 — 1 = 7 positions of separators in 5 + 7 positions of “balls and 
separators” (i.e. selecting 7 positions out of 12 positions). |4?] So in brief, we are essentially selecting 
7 distinct positions (ie. the positions of the separators) out of the 5 + 7 distinct positions (i.e. the 
positions of the “balls and separators”) noting that the selection of these positions is not subject to 
order and hence it is like selecting 7 distinct balls (at once) out of 12 distinct balls. Accordingly, the 
number of possible arrangements is Og*?~' = Cet! (see part a of Problem 25). 

Now, if the n particles in our Problem correspond to the 5 balls in this example, and the k states in our 
Problem correspond to the 8 compartments in this example (noting that the separators, whose number 
is the number of compartments minus 1, correspond to k — 1 states), then we can conclude from this 
example (by generalizing its pattern) that the number of possible configurations for occupancy in the 
Bose-Einstein statistics is Geet =Cn*t-1. Also see Problem 4 of § 7.2. 

Note: the above three types of distribution represent the main (i.e. famous and physically significant) 
distributions but they obviously do not exhaust all the possibilities. For example, we can imagine 
a statistics of distinguishable particles (as in Maxwell-Boltzmann statistics) that cannot occupy the 
same state (as in Fermi-Dirac statistics).!4%] In fact, there are other types of statistics (some can be 
found in the literature and some can be proposed and synthesized). 

PE: Give several examples from daily life for each one of the above distributions.|44! Also consider and 


[40] In other words, each one of the n particles can assume any one of the k states and hence by the fundamental principle 


of counting the number of possibilities is: 
n factors 
——————$—$— ——— 


kxkx--+-Xk=k” 


41) In fact, the situation here is similar in part to the situation in Problem 32 (noting that the particles here, unlike the 
shirts of Problem 32, are indistinguishable and hence the 5! factor of Problem 32 does not arise here). 

[42] We may equally say: we are interested in the combination of the 5 positions of balls in 5+ 7 positions of “balls and 
separators” (i.e. selecting 5 positions out of 12 positions). We may also consider this as a permutation with repetition 
problem (see Problem 30) where we have 12 objects 5 of which are identical and 7 of which are identical and hence the 
number of their distinguishable permutations is C3% a ae 

[43] The number of possible configurations for occupancy (of n particles and k available states where k > n) in such statistics 
is P®. This is because we can select any one of the k available states for the occupancy of the 1° particle, any one of the 
remaining k — 1 states for the occupancy of the 2”¢ particle, ... , and any one of the remaining k — n+ 1 states for the 
occupancy of the n*” particle, and hence by the fundamental principle of counting we have kx (k—1)x---x (k—n+1) = P* 
possible configurations. 

[44] For example, storing 10 distinct balls (e.g. by having different colors or size or weight) in 3 large boxes (where each 
box can accommodate all the 10 balls) is similar to the Maxwell-Boltzmann statistics, storing 6 identical balls in 9 small 


2.2 The Rules of Counting 42 


S S S S S S S 
@i@iO@iOlie!) | | F 
S S S S S S S 
tT [@| |@/@/@\/@| FT 
S S S S S S S 
‘t | | (@O@/@/O\/@| FT 

S S S S S S § 
W@i@eeeerisr 


Figure 4: Four specific arrangements of the possible ORRE Bic = Gert = configurations of the occupancy 
of 5 balls in 8 compartments according to the Bose-Einstein statistics. We note that the black circles 
represent balls, w stands for wall and s for separator. See Problem 34 of § 2.2. 


investigate other types of distribution, e.g. distributions similar to the Maxwell-Boltzmann statistics 
or the Bose-Einstein statistics but (some or all of) the available states have specific capacities. 

35. Referring to Problem 34, let have 3 particles and 4 states. Find all the possibilities of occupancy 
according to: 
(a) Maxwell-Boltzmann statistics. (b) Fermi-Dirac statistics. (c) Bose-Einstein statistics. 
Answer: Let us use 3 English letters (i.e. different or identical depending on the case) to represent 
the 3 particles and use 3 separators to represent the 4 states (ignoring the walls on the two sides). 
(a) In the Maxwell-Boltzmann statistics the particles are distinguishable and therefore we use 3 dif- 
ferent letters (a, b and c) to represent the particles. Moreover, there is no restriction on the number 
of particles that can occupy any one of the states. In this statistics we have k” = 4° = 64 possibilities 


which are: 

abc| | | jabe| | | |Jabe| | | |Jabe albe| | a| |be| a| | |be blac| | 
b| |ac| b| | Jac clab| | c| |ab| c| | |ab bela| | ja|be| jal |be 
ac|b| | |blac| |b] Jac ablc| | |c|ab| |c| |ab be| |a| |belal 

| |albe ac| |b| |jac|b| | |blac ab| |c| jab|c| | |clab be| | |a 
|be| |a | |bcla ac| | |b jac| |b | |Jac|b ab| | |c jab| |c | |able 
alb|e| alc|b| alb| |c alc| |b al |blc al |c|b bla|c| biclal 
bjal |c bje| a bj ale bj |cla clalb| c|bla| cla| |b cb] |a 
c| |a|b c| |bla ja|blc ja|c|b |blalc |blcla |cl|alb |c|bla 


(b) In the Fermi-Dirac statistics the particles are indistinguishable and therefore we use a single letter 
(i.e. a) to represent the particles. Moreover, we have the restriction that no more than one particle 
can occupy any state. In this statistics we have Ck = CS = 4 possibilities which are: 


alala| ala| |a al jaja lajala 


boxes (where each box can accommodate only one ball) is similar to the Fermi-Dirac statistics, and storing 10 identical 
balls in 3 large boxes (where each box can accommodate all the 10 balls) is similar to the Bose-Einstein statistics. 
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36. 


37. 


38. 


(c) In the Bose-Einstein statistics the particles are indistinguishable and therefore we use a single letter 
(i.e. a) to represent the particles. Moreover, there is no restriction on the number of particles that can 
occupy any one of the states. In this statistics we have C?+*—! = C$ = 20 possibilities which are: 


aaa| | | jaaa| | | |Jaaa| | | |Jaaa alaa| | 
al |aal a| | jaa aala| | jajaa| ja] jaa 
aa| |al jaala| | Jajaa aa| | |a jaa| |a 
| jJaala alala| ala| |a a| lala jajaja 


PE: Repeat this Problem for 3 particles and 3 states. 

Repeat Problem 35 assuming a statistics of distinguishable particles that cannot occupy the same 
state. 

Answer: In this statistics the particles are distinguishable and therefore we use 3 different letters (a, 
b and c) to represent the particles. Moreover, we have the restriction that no more than one particle 
can occupy any state. As explained earlier (see footnote [43]), we have P* = P} = 24 possibilities 
which are: 


alb|c| alc|b| alb| |c alc| |b a| |blc a| |c|b blalc| blca| 
bial |c blc| |a b| Jalc b| |cla cla|b| c|bja| cla| |b c|b] |a 
c| |a|b c| |bla ja|b|c jalc|b |blalc |blcla |cla|b |c|bla 


Note: some readers may have noticed that the answer of this Problem is identical to the last three 
rows of the answer of part (a) of Problem 35. This should be no surprise since the answer of the present 
Problem can be obtained from the answer of part (a) of Problem 35 by excluding all the possibilities 
of part (a) of Problem 35 that have multiple occupancy to any single state. 

PE: Repeat Problem 35 assuming a statistics of indistinguishable particles where the capacity of state 
1 is one particle (i.e. it cannot accommodate more than one particle), the capacity of state 2 is two 
particles, the capacity of state 3 is three particles, and the capacity of state 4 is four particles. 

A digital byte is an ordered arrangement of 8 bits where each bit can be either 0 or 1. How many 
different bytes that contain exactly four 1’s we have? 

Answer: If we consider the positions of the bits within the byte as states and consider the four 1’s as 
indistinguishable particles (noting that no one of these 1’s can occupy more than one position) then 
this is like a Fermi-Dirac statistics and hence we have C? = 70 different bytes containing exactly four 
1’s. 

We may also argue (differently and more simply) that we can assign the first 1 to any of the 8 positions, 
the second 1 to any of the remaining 7 positions, the third 1 to any of the remaining 6 positions, and 
the fourth 1 to any of the remaining 5 positions, and hence we have 8 x 7 x 6 x 5 = P? possibilities. 
However, because the 1’s are indistinguishable we should divide this by 4! (ie. the number of their 
permutations) and hence we should have P?/4! = C$ = 70. In fact, this can be understood (more 
simply) as choosing 4 positions (out of the 8 available positions) for the assignment of the “1” bits. 
PE: Repeat the Problem for the different bytes that contain exactly two 1’s and list all these bytes. 
Referring to Problem 1 of § 1.5.4, find the number of possibilities for the random selection of 5 balls 
out of 10 in each one of the 4 cases considered in that Problem. 

Answer: 

(a) For sampling with order and without replacement we have P;° = 30240 possibilities (because the 
1°* drawn ball can be any one of the 10 balls, the 2”¢ drawn ball can be any one of the remaining 9 
balls, ..., and the 5’ drawn ball can be any one of the remaining 6 balls and hence by the fundamental 
principle of counting we have 10 x 9 x 8 x 7 x 6 = P3° possibilities). In fact, this is an application of 
Eq. 29. 

(b) For sampling with order and with replacement we have 10° possibilities (because each one of the 
5 drawn balls can be any one of the 10 balls and hence by the fundamental principle of counting we 
have 10 x 10 x 10 x 10 x 10 = 10° possibilities). In fact, this is an application of Eq. 30. 

(c) For sampling without order and without replacement we have C}° = 252 possibilities (because this 
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39. 


AO. 


Al. 


42. 


is the same as the number of distinct sets of size 5 in a set of size 10). In fact, this is an application 
of Eq. 31. 

(d) For sampling without order and with replacement we have Cjo,5 = 2002 possibilities. In fact, this 
is an application of Eq. 32. 

PE: Considering the answer of this Problem, can we make the following generalization: 


Nan Ss Nar < Non S Nor 


where Ny» is the number of possibilities without order and without replacement, N,,, is the number 
of possibilities without order and with replacement, No, is the number of possibilities with order and 
without replacement, and No, is the number of possibilities with order and with replacement? Justify 
your answer. 

Referring to Problem 2 of § 1.5.4, find the number of possibilities for the random selection of 5 balls 
out of 10 in each one of the 2 cases considered in that Problem (i.e. with and without replacement). 
Answer: As the balls are indistinguishable we have only one possibility in each one of the 2 cases. 
PE: Find the number of possibilities for the random selection of 5 balls out of 10 identical balls if 5 
of the 10 balls are black and the other 5 are white. Assume the selection is (a) with replacement and 
(b) without replacement. Also consider order. 

We have 40 passengers and 5 buses each of 60-passenger capacity. How many ways these passengers 
can be distributed on these buses if: 

(a) The assignment of passengers to buses is considered, i.e. it is important which passenger travels 
on which bus. 

(b) The assignment of passengers to buses is not considered, i.e. the only important thing is the 
number of passengers traveling on which of these buses. 

Answer: 

(a) We have 40 distinguishable passengers to be assigned to 5 distinct buses. Referring to part (a) of 
Problem 34, this is like a Maxwell-Boltzmann statistics with n = 40 and k = 5 and hence we have 
k” = 54° ways. 

(b) We have 40 indistinguishable passengers to be assigned to 5 distinct buses. Referring to part 
(c) of Problem 34, this is like a Bose-Einstein statistics with n = 40 and k = 5 and hence we have 
GUrhe! = Oi = 1b iol ways: 

PE: Repeat the Problem assuming we have 40 passengers and 10 taxis each of 4-passenger capacity. 
In circular permutation n distinct objects (n > 3) are arranged in a circle (i.e. the order of the 
objects in the circle is considered but with no consideration of start/end and hence nothing presumably 
changes if we rotate the circle due to the circular symmetry). Find the number of circular permutations 
of n objects arranged in a circle. 

Answer: If the permutation is straight then we have n! permutations. So, let us bend this straight 
permutation to make a circle. Accordingly, we still have n! permutations but with the condition that if 
we rotate any one of these circular permutations by an angle of size 27/n we get the same permutation. 
Now, in a circle we have n angles of size 27/n. This means that we need to divide n! by n to get the 
number of distinct circular permutations. Hence, the number of circular permutations of n objects is 
(n— 1)!. 

We may also argue (in another way) that we can assign one object (say the first object) to any one of 
the n positions in the circle (noting that nothing changes if we rotate the circle and hence the assign- 
ment to any one of the n positions is indistinguishable from the assignment to any other position which 
means that we essentially have only one way for this assignment). The remaining (n — 1) objects can 
then be arranged in (n — 1)! distinct ways (noting their positions relative to the first object considering 
a specific sense of order, e.g. clockwise). Hence, by the fundamental principle of counting we must 
have 1 x (n — 1)! = (n — 1)! circular permutations. 

PE: Give some examples of circular permutation in real life. What is the number of circular permu- 
tations of 6 objects? What is the number if the 6 objects are identical? 

Give all the circular permutations of (a) the letters {a, b, c} (b) the letters {a, b, c, d} (c) the letters 
{a, b, c, d, e}. 
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43. 


AA. 


45. 
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Answer: 

(a) We have (3 — 1)! = 2 circular permutations which are: abc, acb. 

(b) We have (4 — 1)! = 6 circular permutations which are: abcd, abdc, acbd, acdb, adbc, adcb. 
(c) We have (5 — 1)! = 24 circular permutations which are: 


abcde abced abdce abdec abecd abedc acbde acbed acdbe acdeb  acebd_ acedb 


adbce adbec adcbe adceb adebc adecb aebcd aebdce aecbd aecdb aedbc  aedcb 


PE: Plot the above circular permutations on circles.|4°! 


How many 5-letter strings can be made from the letters {s, d, i, e, f, j, o} if no vowel is allowed to be 
at the beginning or the end and with no repetition? 

Answer: We have 4 consonants and 3 vowels. Now, the first letter can be any one of the 4 consonants 
and the last letter can be any one of the remaining 3 consonants while the 3 middle letters can be filled 
with the 3-letter permutations of the remaining 5 letters. Hence, the number of 5-letter strings is: 


5! 
4x P?x3=4x a 
PE: Repeat the Problem assuming repetition is allowed and the last letter can be a vowel. 
We have 7 numbered balls (4 black and 3 white) to be arranged in a row. In how many ways this can 
be done if: 
(a) The balls are grouped by color. (b) Only the black balls are grouped. 


Answer: 

(a) In this case we have 2 main possibilities: black first and white first. Now, the blacks can be 
arranged in 4! ways and the whites can be arranged in 3! ways. Therefore, we have 2 x 4! x 3! = 288 
ways. 

(b) In this case we have 4 main possibilities: black first, black second, black third, and black last. 
Again, the blacks can be arranged in 4! ways and the whites can be arranged in 3! ways. Therefore, 
we have 4 x 4! x 3! = 576 ways. 

PE: Repeat the Problem (i.e. both parts) assuming: 

(a) The balls of each color should also be ordered according to their numbers increasingly. 

(b) The balls are not numbered (and hence they are indistinguishable except by color). 

We have 9 numbered balls (4 black, 3 white and 2 red). In how many ways they can be arranged: 


(a) In a row grouped by color. (b) In a circle grouped by color. 


Answer: 

(a) The 3 colors can be arranged in a row in 3! ways, the 4 blacks in 4! ways, the 3 whites in 3! ways 
and the 2 reds in 2! ways. Hence, there are 3! x 4! x 3! x 2! = 1728 ways. 

(b) The 3 colors can be arranged in a circle in 2! ways (see Problem 41), the 4 blacks in 4! ways, the 
3 whites in 3! ways and the 2 reds in 2! ways. Hence, there are 2! x 4! x 3! x 2! = 576 ways. 

PE: Repeat the Problem assuming the balls are not numbered and no grouping by color is imposed. 
Define and explain C," where n is a positive integer and k = 0,1,...,00. 

Answer: CZ" is the “binomial coefficient” appearing in the binomial theorem expansion for negative 
integer powers which is given by: 


(e@+y"=)) Or aty = rope skye (jx/yl<1) (35) 
k=0 k=0 


[45] For example, the circular permutations abc and acb can be plotted as: 


CIE 
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For example, the well-known expansion of (« + 1)~” for |x| < 1 is given by: 


n(n +1) 2 n(nti(n+2) 5. nlnt ln +2)(n+3) 4 


1" =1 

(x +1) nx 5 6 oA (36) 

If we compare the coefficients in Eqs. 35 and 36 we see: 

oa a Cre =n Ca = n(n) 
n+3—-1 _ n(n+1)(n+2) +4-1 _ n(n+1)(n+2)(n+3) 

C3 a es Ci = ae) 


And this general pattern continues. 

PE: By implementing the binomial theorem expansion for negative integer powers (according to the 
above-given form) in a spreadsheet (or in a computer code or in a script of a mathematical software 
package) show that the expansions of the following binomial expressions converge to the indicated 
values (where convergence is symbolized by —). 


(a) (0.3 + 1)-4 5 1.374. (b) (—0.15 + 1)-29 4 0.85-29, (ce) (3.14+4.6)-? 3 7.7-7. 
(d) (-1.44+2.1)-" 30.7-1, — (e) (4.1 - 6.8)-8 > (-2.7)-8. —(£) (-5.9 — 8.4)-9 > (-14.3)-9. 


2.3. Dealing with Very Large and Very Small Numbers 


It is common in the probability theory (and related subjects like counting methods and combinatorics 
in general) to face problems involving very large or/and very small numbers which only few calculators 
(or compilers or mathematical software packages) can manage.!46] For example, the calculation of the 
factorial of 2369 or the value of 0.5°47° cannot be done by ordinary means like calculating the factorial 
of 15 or the value of 0.5°. In fact, in some cases even if a calculator that can manage this is available it 
could be difficult to make use of it due to practical and procedural issues. So, being able to deal with 
this sort of extreme calculations independently and by our simple means (and hence be less dependent 
on external help such as specialized software packages which may not be available or not usable) is a big 
advantage and can even be a necessity in some situations. 

It is worth noting that this sort of problems and difficulties have no easy solution or fix in general, and 
hence we need to invent and tailor innovative methods that can deal with these problems case by case. 
In the following subsections we will outline some elementary methods that can be used to overcome (or 
at least mitigate the severity of) these difficulties. However, we should insist that these problems and 
difficulties could require (in some cases) exceptional and non-elementary methods if they are solvable at 
all. 


2.3.1 Use of Calculational Tricks 


There are many calculational tricks (which are usually case specific) that can be used for overcoming the 
difficulties of extreme calculations. A particularly useful trick in the calculations of probability is what we 
call “splitting method” which can be used when we have calculations involving products or/and quotients 
of too many factors some of which are small and some are large. In such cases we split these factors in 
small groups each of which involves some small factors and some large factors and the results obtained 
from calculating these groups are then processed easily (e.g. by multiplying or/and dividing them) to 
obtain the final result. For example, it is common in the probability theory to face probabilities given by 


[46] We should note that dealing with very large and very small numbers (which is the subject and title of this section) should 
be seen as a typical (and more common) example for calculational difficulties encountered in the field of probability (and 
related subjects). So, the proposed methods in the following subsections should be more general in their benefits for 
easing and overcoming the difficulties encountered in this field. Moreover, other methods for easing and overcoming these 
difficulties could be proposed in this regard. In fact, it may be more appropriate to title this section with something like 
‘Dealing with Calculational Difficulties”. However, we preferred to title it according to the most common calculational 
difficulties encountered in this field (considering our limited needs in this book and avoiding the obligation to go beyond 
our scope if we choose a more general title). 
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this type of formula: 

23219 51321 
As we see, neither Cjéj' nor 0.51%! is easy to calculate because the first is too big (and hence it is seen 
as infinity by the ordinary calculators) while the second is too small (and hence it is rounded to zero by 
the ordinary calculators) and this makes it impossible to calculate this formula as a product of C7@5' and 
0.51321. However, we can put this formula in the following form which can be easily calculated (e.g. by 
using a spreadsheet): 


1321! 


4691(1321 — 469)! 
1321 x 1320 x +++ x 853 


Cig Os x 0.51321 


= 1321 
7 4691 a 
1321 x 1320 x --+ x 853 3) 440 

= 760 x. (0.5") ~~ * 0.5 
_ 1821 x 1320 x --- x 853 440 
= 169! x 0.125*°™ x 0.5 
1821 x 0.125 . 1320 x 0.125 , 2 883 x 0.125 . 882 x 0.125 r 
7 469 468 31 30 

881 880 853 

—— X — X-:: xX — x 0.5 

29 28 1 
~ 0.35208 x 0.35256 x --- x 3.56048 x 3.675 x 30.37931 x 31.42857 x --- x 853 x 0.5 
~ 7.9 x 1072" 


Problems 


1. Find the probability of a given event which is given by P = C752 0.61407 0.498. 
Answer: Using the splitting method we have: 


Pires 1 e Co ae 


2375! 1407 968 
=> . A 
[4071237514071 °° 8 


_ 2875 x 2874 x ++ x 969 9 gs407 4968 


1407! 
=. DATO ORT Ke OOGD ohh a gia yey 
= wa 0.69% 0.6439 0.4 
_ 2875 x 2374 x ++ x 969. agg. nasa 
= aon 0.24988 0.6 
_ 2375 x 0.24 | 2374.x 0.24 1409 x 0.24 1408 x 0.24 
= 1407 1406 441 440 
1407 x 0.6 | 1406 x 0.6 970 x 0.6 969 x 0.6 
439 438 2 1 
~ 0.40512 x 0.40523 x --- x 0.76680 x 0.768 x 1.92301 x 1.92603 x «+ x 291 x 581.4 
~ 0.012544 


PE: Find the probability of a given event which is given by P = C?$/2 0.611497 0.399. 

2. Calculate the following factorials: 
(a) 235!. (b) 569!. (c) 873!. (d) 1382!. (e) 1948!. (f) 2634!. 
Answer: For only very few practical purposes the exact integer value is required. This is because 
there are too many digits in the number which makes it almost useless in its exact integer format. So, 
in the overwhelming majority of applications (especially in science) what is required is an approximate 
value that gives the right order of magnitude of the factorial in a fractional scientific format,|47] and 


[47] Tn fact, this format is what we usually need in the calculations of probabilities. 
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this is what we will do here. Using the laws of logarithms, we have: 
log(n!) = log(1) + log(2) +--+ log(n — 1) + log(n >> log(z) 


This sum can be easily calculated with a humble calculator (e.g. a spreadsheet) and the result is 
obtained straightforwardly. On using logarithm to the base 10 we get: 


(a) 37725 log(i) ~ 456.726522256951. (b) 37209 log(i) ~ 1322.32202904815. 
(c) 2 log(i) ~ 2190.23599056559. (d) 3713® log(i) ~ 3741.95651163172. 
(e) D247 log(i) ~ 5564.15753179453. (f) 32254 log(’) ~ 7868.07968605461. 
Therefore: 


(a) 935! = 1 gles (235!) ~ 19456: 726522256951 = 199:72652225695 x 10456 ~ 5.32748526199126 x 10456 

(b) 569! = 1 glos(569!) ~ 191822.32202904815 = 199:32202904815 x 191822 ~ 2.09908027765868 x 191822 
(c) 873! = jglos(873!) ~ 1927199-23599056559 = 199:23599056559 x 192199 ~ 1.72183117031237 x 197199 
(d) 1382! = 1 glos(1382!) ~ 193741-95651163172 = 1.99:95651163172 x 193741 ~ 9.04714668414713 x 193741 
(e) 1948! = jglos(1948!) ~ 192564.15753179453 = 1.99:15753179453 x 192564 ~ 1.43724826990598 x 192564 
(f) 9634! = 1 glos(2634!) ~ 197868.07968605461 = 1.99:07968605461 x 197868 ~ 1.20139564857308 x 197868 


PE: Calculate the following factorials (approximately): 


(a) 2971. (b) 473}. (c) 701!. (d) 13911. (e) 1883}. (f) 25641. 
3. Calculate the following permutations:!48! 
(a) P53”. (b) Piss: (c) Poss”. (d) Pizea- 


Answer: Again, for only very few practical purposes the exact integer value is required. So, we repeat 
our argument and approach in Problem 2 and hence calculate an approximate value that gives the 
order of magnitude in a scientific format. Using the laws of logarithms, we have: 


! n 


log (P;’) = log al = log [nx (n—1)x-++x (n-k+1)] = SS log(2) 
i=n—k+1 


This sum can be easily calculated (by a spreadsheet or a computer code for instance). On using 
logarithm to the base 10 we get: 

(a) P2388 = polos(Ps2°) w 4Q120.751443959533 ___ 1 90.751443959533 . 49120. 5 4914131616555 x 10!29 

(b) PSS = poles(Prss) ~ 49440-237515517598 __ 1 90.237515517598 . 49440 ~ 4.79788771785137 x 10449 

(c) Pd20 — 19!0s(Pss2") ~ 192893-061957369540 __ 4 90.061957369540 Y 492893 S 1 15334004011238 x 102893 
(d) P'875 = 19!0s(Pizc2) ~ 493717-479122616300 _ 4 90.479122616300 . 493717 ~ 3 91385681969706 x 103717 


PE: Calculate the following permutations (approximately): 


(a) Poe" (b) Pro3. (c) Pri’. (d) Piss. 
4. Calculate the following binomial coefficients: 
(a) Cis3”. (b) C755. (c) Cii3s- (d) Cizot. 


Answer: If we repeat the argument and method of the previous Problems then we have: 


1 (OR) =o | a, | = tne [PRD Xe > pea) -Ybwt) 


i=n—k+1 


[48] “Permutations” and its alike (e.g. combinations) in such context should mean “number of permutations”. 
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These sums can be easily calculated (by a spreadsheet or a computer code for instance) and their 
difference is obtained. On using logarithm to the base 10 we get: 


(a) Ces 19!08(Cas3” ) ~ 19319-592532861485 = 1909:892532861485 x 109319 ~ 3.91320735824353 x 109319 


C1492 


Gas = 1gles 769 ) ~ 10447-143929164639 = 1.90:143929164639 x 10447 ~ 1.39292959140671 x 10447 
~ 10478-277091080755 a 109:277091080755 x 10478 ~ 1.89274052481793 x 10478 


1438 *) 
ia02) ~ 19825-190214220116 = 190:190214220116 x 10825 ~ 1.54958077671787 x 19825 


( 
(71937 — jgles(Ci 
d) Cig = solee(° 


PE: Calculate the following binomial coefficients (approximately): 


(a) C538”. (b) C359°. (c) C7g52- (d) Ci333- 
5. Calculate the following multinomial coefficients: 
(a) Ch5.6,19- (b) C342,16,27,29" (c) CIT 5,22,36,44,65° (d) C454,77,97,103,137" 


Answer: If we follow the argument and method of the previous Problems then we have: 


n! 
oe | Ino! | 
1+ TQ. ... Up: 


=  log(n!) — log(n,!) — log(ng!) — --+ — log(n,!) 


=D clog(i) — ) /log(i) — } /log(é) — +++ — } loa() 
n k Ng 
= Yotwt 92 (Pee) 


j=l \i=l 


= 

3 

N 

3 

r 

“—" 
II 


These sums can be easily calculated (e.g. by a spreadsheet or a computer code) and their algebraic 
sum is obtained. On using logarithm to the base 10 we get: 


meres = 10!8(C2%5,6.19) re 199-0683449986 . 1915 ~ 117042879714 x 1015 

(b) C8 o 169729 = 10!8(CSa10.27.20) —w 199-5611911824 , 1950 ~ 3 64075271761 x 10° 

(c) CB 5 99.36.44,65 = 10!°8(CHis.22.06,44,05) a. 19-155895067 , 19131 ~ 143184189946 x 1019! 

(d) CFE, 7797,103,137 = 1018(Ca4.77.07.108,197) wv 190-265827326 19322 ~ 184915990731 x 10°”? 
PE: Calculate the following multinomial coefficients (approximately): 

(a ) CE ,6,6,7° (b) CP 18,09,31° (c) C3353, 34,56,61,77° (d) C3%, 71,84,91,99" 

6. Calculate the following powers: 
(a) 0.451993, (b) 0.718712, (c) 12362019, (d) 63913361, 


Answer: We have: 

log (a¥) = ylog(2) 
So, on using logarithm to the base 10 we get: 
(a) 0.451693 _ 191693 log(0.45) ~ 197 987-111214178343 ~ 7.74079955396 x 1907588 
(b) 0.713712 = 103712 log(0.71) ~ 197952. 129009554793 ~ 7.43002791118 x 1907523 
(c) 12362019 =s 192019 log (1236) ~ 1 99242.785292449900 ~ 6.09947491993 x 198242 
(d) 63913361 = 193361 log(6391) ~ 1912790.5167957807 ~ 3.28697030560 x 1912790 


PE: Calculate the following powers (approximately): 
(a) 0.32981, (b) 0.85239, (c) 25161705, (d) 59284920, 
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2.3.2 Use of Analytical Approximations 


There are many analytical approximation rules and formulae (found in the literature of calculus for 
instance) that can be used for overcoming the difficulties of extreme calculations. The well-known example 
of these rules and formulae (which is widely used in the probability theory) is the Stirling approximation 
for factorials of large numbers which is given by:!4%! 


n! ~ V2rnn" ee” (large n) (37) 


Although this may not be of great help for calculating the factorial itself, it can be useful for simplifications 
(e.g. by cancellation) when the factorial is involved in a formula.!°! We also note that n” e~” which can 
be written and calculated as (n/e) x (n/e) x +++ x (n/e) may also help in easing the calculations (i.e. by 
keeping the size of the involved numbers manageable and under control, as we did earlier in some of our 
calculational tricks by pairing large factors with small factors). 


Problems 


1. Plot a graph for the ratio of the Stirling approximation to the corresponding factorial and comment. 
Answer: See Figure 5. 
Comment: We note the following: 
e The approximation improves with increasing n. In fact, the ratio approaches 1 asymptotically from 
below. 
e The approximation is good even for low n, and hence the “large n” restriction is to ensure greater 
accuracy and to indicate that the approximation improves with increasing n. 
e The Stirling approximation is always lower in value than the factorial, and this should be taken into 
account when assessing results obtained by the Stirling approximation (e.g. when assessing the type 
of errors introduced by this approximation). 
PE: Investigate the use of Stirling’s approximation in the probability theory. 
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Figure 5: The plot of the ratio of Stirling’s approximation to the corresponding factorial. See Problem 1 
of § 2.3.2. 


[49] For more details about Stirling’s formula, the reader is referred to the textbooks of calculus. 

[50] For example, Stirling’s approximation (as given by the above formula) is widely used in analytical derivations and 
theoretical arguments (possibly much more than its use in practical calculations especially these days where computers 
usually overcome most practical difficulties). 
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2.3.3. Use of Computing and Programming 


Some problems (in probability and related subjects) require using computational methods and program- 
ming languages to solve (if they are solvable). This is when the use of normal calculators (or similar tools 
like spreadsheets) is impossible or impracticable. For example, calculating cumulative probability distri- 
butions involving factorials or binomial coefficients of large numbers may require designing and writing 
special codes or routines that employ relatively complicated computational methods and mathematical 
operations (noting that direct application of elementary operations like multiplication cannot do the job 
due to their limitations in application). We will see in the Problems of this subsection some simple exam- 
ples of the use of coding and programming to do some probability-related calculations. We will also meet 
in the future more such examples (see for instance Problem 12 of § 4.1.2, Problem 7 of § 4.1.3, Problem 
11 of § 4.1.4 and Problem 9 of § 4.2.2). 


Problems 


1. Write a simple program that can calculate the factorial of large numbers. 
Answer: See the Factorial.cpp code. 
PE: Plot a flowchart representing the algorithm of the Factorial.cpp code. 

2. Write a simple program that can calculate permutations involving large numbers. 
Answer: See the Permutation.cpp code. 
PE: Plot a flowchart representing the algorithm of the Permutation.cpp code. 

3. Write a simple program that can calculate binomial coefficients involving large numbers. 
Answer: See the BinomialCoefficient.cpp code. 
PE: Plot a flowchart representing the algorithm of the BinomialCoefficient.cpp code. 

4. Write a simple program that can calculate multinomial coefficients involving large numbers. 
Answer: See the MultinomialCoefficient.cpp code. 
PE: Plot a flowchart representing the algorithm of the MultinomialCoefficient.cpp code. 

5. Write a simple program that can calculate (non-negative integer) powers of positive numbers (small 
and large). 

Answer: See the Power.cpp code. 

PE: Plot a flowchart representing the algorithm of the Power.cpp code. Also explain why in the final 
stage (i.e. output stage) the code distinguishes between the case of negative logarithm and the case of 
non-negative logarithm. 


2.3.4 Use of Other Functions and Distributions 


Some functions and distributions can be approximated by other functions and distributions which are 
easier to calculate or estimate and may be the only possible or viable choice for solving the given problem. 
For example, the Poisson distribution (see § 4.1.4) can approximate the binomial distribution (see § 4.1.2) 
under certain conditions and hence in some circumstances we may use the Poisson distribution to solve 
a binomial distribution problem. Similarly, the normal distribution (see § 4.2.2) can approximate the 
binomial distribution under certain conditions and hence in some circumstances we may use the normal 
distribution to solve a binomial distribution problem. We will meet in the future more clarifications and 
examples about the use of functions and distributions to approximate other functions and distributions. 


Problems 


1. Give some examples (related to the probability theory) of approximations of functions and distribu- 
tions made by using other functions and distributions. 
Answer: For example: 
e The Poisson probability distribution is used (under certain conditions) to approximate the binomial 
probability distribution (see § 4.1.4). 
e The binomial probability distribution is used (under certain conditions) to approximate the hyper- 
geometric probability distribution (see Problem 5 of 4.1.6). 
e The normal distribution is used (under certain conditions) to approximate a number of other prob- 
ability distributions such as the binomial and Poisson distributions (see for example § 4.2.2). 
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e We may even consider the use of the Stirling formula to approximate factorials as another example 
(see § 2.3.2 as well as Problem 39 of § 3.2). 

The reader is referred to the subsections of § 6.2 for more details and examples. 

PE: Give more examples (related to the probability theory) of functions and distributions used to 
approximate other functions and distributions. 


2.4 Venn and Tree Diagrams 


Venn diagrams and tree diagrams are two visual abstract devices which are commonly used in the literature 
of probability theory for the purpose of demonstration, illustration, simplification, and so on.2!! Venn 
diagram is strongly linked to the subject of sets (see § 2.1) where it employs graphic encirclement to show 
the relation between sets as collections of elements.5?!_ On the other hand, tree diagram is strongly linked 
to the subject of counting (see § 2.2) and probability where it is used to demonstrate the branching of 
various possibilities for the outcome of probabilistic trials and experiments. It is also used to demonstrate 
and outline hierarchical structures (regardless of their link to probability) and other similar purposes. 

In the following Problems we give some examples of these devices in the context of sets and counting 
(which were investigated in § 2.1 and § 2.2). Further investigation of these devices (as well as other 
devices) in the context of probability will be given in § 3.4. 


Problems 


1. Describe tree diagram in some detail. 

Answer: Tree diagram is a graphic device used to list all the possibilities for the outcome of a series of 
trials (or observations) where each trial (or observation) can take place in a finite number of ways.|°3 
Tree diagram branches from left to right (or from top to bottom) starting usually in a single point (or 
node) and branching consecutively to multiple points where from each point (except the end points) a 
number of branches (equal to the number of possibilities that can emerge from that point) appear. The 
end points of tree diagram represent the ultimate outcomes of that series of trials (or observations) 
where each end point is identified distinctively by the route that connects it to the start point. 

It is noteworthy that tree diagram is commonly used to facilitate the enumeration and listing in a 
systematic way and hence reducing confusion and avoiding error. For example, if we want to list 
all the 3-digit permutations of the set {1,2,3,4} in an improvised and spontaneous way it may be 
confusing and prone to errors and mistakes, but if we do it through the use of a tree diagram (as we 
will do in Problem 7) then it will be easy and robust. 

PE: Describe Venn diagram in detail. 

2. A set U of (lower case) English letters contains the elements: c,k,t,r,o,p,j,s,d,w,z. A and B are 
two subsets of U where A contains the elements: c,s,p,z,k and B contains the elements: c, 7, s,d,w. 
(a) Draw a Venn diagram representing these sets. 

(b) Use your Venn diagram to identify the following: A, B, AN B and AUB. 
Answer: 
(a) The Venn diagram is given in Figure 6. 

b) 

A= {t,r,0,j,d,w} B = {k,t,r, 0, p, z} AN B= {e,s} AUB ={c,k,p,j,s,d,w, 2} 

PE: Do the following: 

(a) Draw a Venn diagram representing the letters of “PROBABILITY THEORY” where the sets 2 


=> 


[51] We already met an important use of Venn diagrams in proving identities of set theory (see Problem 21 of § 2.1). 

[52] Venn diagram is usually made of a rectangle (representing the universal set, or the sample space in the case of probability) 
inside which one or more closed curves (representing sets, or events in the case of probability) are sketched. These closed 
curves encircle the members (symbolized as points) of the sets which these curves represent and hence they identify 
theses sets and demonstrate the relationships between them (e.g. if they are disjoint or not). 

[53] Tp fact, this description is appropriate in the context of probability (which is our prime interest in this book). As 
indicated earlier, tree diagram has more general uses and purposes than this (e.g. in demonstrating and outlining 
hierarchical structures). 
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Figure 6: Venn diagram of Problem 2 of § 2.4. 


and A represent the letters of “PROBABILITY” and “THEORY” respectively. 
(b) Use your Venn diagram to identify the following: 0, A, ANA, QUA, ANA and QUA, 
3. Use Venn diagrams to demonstrate intersection, union and complement of two sets (A and B). Also 
demonstrate disjoint sets. 
Answer: See Figure 7. 


Figure 7: Venn diagrams demonstrating intersection (upper left), union (upper right), complement (lower 
left) and disjoint sets (lower right). The shaded area in the first three frames represents the sets of interest 
(i.e. intersection, etc.). See Problem 3 of § 2.4. 


PE: Use Venn diagrams to demonstrate A — B and B — A for the following cases: 


(a) ANBAOSwithAG Band BEA. (b) A is a proper subset of B. (c) A=B. 
4. Prove or disprove the following relations using Venn diagrams: 

(a) (A— B)U(B-— A) = (AUB) -— (ANB). (b) AU(B—C) =(A- B)U(B-C). 

Answer: 


(a) In Figure 8 we constructed Venn diagrams (in stages) for the left hand side of this relation in the 
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Figure 8: The Venn diagrams of part (a) of Problem 4 of § 2.4. 


upper row and for the right hand side of this relation in the lower row. As we see, the Venn diagrams 
of the two sides are the same and hence this relation is correct. 

(b) In Figure 9 we constructed Venn diagrams (in stages) for the left hand side of this relation in the 
upper row and for the right hand side of this relation in the lower row. As we see, the Venn diagrams 
of the two sides are not the same and hence this relation is incorrect (in general). 


C 


Figure 9: The Venn diagrams of part (b) of Problem 4 of § 2.4. 


PE: Prove or disprove the following relations using Venn diagrams: 

(a) (BNC)—A=(B-A)N(C-— A). (b) (B— A)U(A- B) =U. 
5. A college has 3 main departments: mathematics, science and engineering. The mathematics depart- 

ment is divided to 2 branches: pure and applied. The science department is divided to 4 branches: 
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Mathematics aay en 4 


Physics 
Chemistry 
Biology 
Geology 


College Science 


Electronics 


Mechanics 
Communication 


Information 


Engineering 


Figure 10: The tree diagram of Problem 5 of § 2.4. 


physics, chemistry, biology and geology. The engineering department is divided to 4 branches: electron- 
ics, mechanics, communication and information. Make a tree diagram to demonstrate the structure of 
this college. 
Answer: See Figure 10. 
PE: Put the following in a simple tree diagram: fish, mammals, animals, birds, reptiles, cold-blooded, 
amphibians, warm-blooded. 

6. Make a simple tree diagram to demonstrate the finity/infinity and countability of sets. 
Answer: See Figure 11. 


Finite t-——Countable 
Set —< Countable 

Infinit 

hid a ee 


Figure 11: The tree diagram of Problem 6 of § 2.4. 


PE: Put the following in a simple tree diagram: commutative, set operations, union, non-commutative, 
difference, intersection. 

7. List all the 3-digit permutations of the set {1,2,3,4} through the construction of a tree diagram. 
Answer: See Figure 12. As we see, the enumeration and listing of these permutations through the 
tree diagram make the job easy and less prone to error because it is visual and systematic. Also see 
Problem 1. 

PE: List all the 4-letter permutations of the set {a, b,c, d} through the construction of a tree diagram. 

8. A typical Venn diagram is generally divided into 2” distinct regions (with regard to the inclusion and 
exclusion of members) where n = 1,2,... is the number of sets represented by the diagram. Explain 
this. 

Answer: If the diagram contains only one set (i.e. nm = 1) then we have only two distinct regions: 
either inside or outside (the closed curve representing) the set and hence we have 2! distinct regions. 
If the diagram contains only two sets (i.e. m = 2) then we have four distinct regions: inside both 
sets, or outside both sets, or inside only the first set, or inside only the second set, and hence we have 
2? distinct regions. By induction we can generalize this pattern and conclude that Venn diagram is 
generally divided into 2” distinct regions. 

PE: Link this (i.e. having 2” distinct regions in a typical Venn diagram) to the binomial theorem and 
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Figure 12: The tree diagram of Problem 7 of § 2.4. 


to an identity investigated in Problem 25 of § 2.2. 
9. Classify the 2” regions found in Problem 8 and find the number of regions in each type. 
Answer: We can classify these regions into n + 1 types: 
e Regions not under any set: we have Cj = 1 region of this type. 
e Regions (each of which is) under exactly one set: we have C? = n regions of this type. 


e Regions (each of which is) under exactly two sets: we have C} = die) regions of this type. 


e Regions (each of which is) under exactly n — 1 sets: we have C”_, =n regions of this type. 
e Regions (each of which is) under all sets: we have C” = 1 region of this type. 
As a check, we have (see part j of Problem 25 of § 2.2): 


Co+ CP + CF +---+02 1 4+CRP=S Cp =2” 
k=0 


Note: as we will see (refer to § 3.2) events in the sample space of probability correspond to sets in the 
universal set. Accordingly, the above classification should apply to events in the sample space of any 
particular probability problem. 

PE: Based on the above findings, try to make a link between Venn diagrams, binomial theorem and 
binomial coefficients. 


Chapter 3 
Mathematics of Probability 


In this chapter we present the basic mathematics of probability theory which we need and use in this book 
to solve probability problems. 


3.1 Axioms of Probability Theory 


The entire theory of probability can be constructed from a few axioms (or hypotheses). The number and 
content of these axioms may differ between authors. However, the following set of three axioms (with 
minor variations between authors) seems to be the most commonly used in recent times:!54! 

1. Probabilities are real numbers between 0 and 1, i.e. O< P< 1. 

2. The probability of the sample spacel®*! is 1, ie. P(S) = 1. 

3. The probabilities of disjoint events add, ie. P(AU BU...) = P(A)+P(B)+-:--. 

In the following Problems we will briefly discuss the role of axioms in constructing a theory (in general) as 
well as some general aspects about the characteristics of axioms. Moreover, we will give a few examples 
for the role and use of the axioms of probability in the construction of probability theory in the next 
section (i.e. § 3.2). 

Problems 


1. Outline some of the roles that axioms can play in the construction and representation of a given theory. 
Answer: For example: 
e The axioms provide the starting points or propositions to derive the other propositions and premises 
of the theory. 
e The axioms provide the rational justification for the structure of the theory (in general) and its 
individual parts and components (in particular). 
e The axioms can play a role in organizing, classifying and structuring the given facts (such as obser- 
vations) which make the body and content of the theory. 
e The axioms can reflect the spirit and basic essence of the theory and hence they can provide a simple 
and brief outline of the theory to be used for making general judgments and concise assessments. 
PE: Give more potential roles that axioms can play in the construction and representation of a given 
theory. 

2. Is there any Problem in having more than one set of axioms for a given theory? 
Answer: Not at all. The purpose of any set of axioms is to construct the given theory and put it on 
a rational footing, and hence any set of axioms that can achieve this should be fine. In general, there 
is no unique way for constructing a theory and hence there is no unique set of axioms for constructing 
a specific theory. A given theory is like a house or a building which can be constructed in many 
different ways using different designs or/and construction materials, and all these ways should fulfill 
the objective of the required construction as long as they achieve and realize the construction. 
PE: What can you conclude from the possibility of having more than one set of axioms for a given 
theory? 

3. Give some factors that favor/disfavor some sets of axioms against other sets of axioms for constructing 
a given theory. 
Answer: In our view, the main factors that should be considered in constructing or choosing a set of 
axioms to build a given theory are: 


[54] We note that P symbolizes the probability (of a random occurrence or event) which is a single-valued real function of 
random variable (or variables). 
[55] As we will see in § 3.2, the sample space is the set of all possible outcomes of a given trial. 
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e Clarity: it should be obvious that clarity is an advantage since it reduces the required work and 
minimizes confusion and mistakes, and hence more clear sets of axioms should be favored against less 
clear sets. 
e Simplicity: so the set of axioms that leads to the construction and derivation of the theory more 
easily and simply is favored in comparison to another set of axioms which is less simple and hence it 
causes more complications and difficulties. 
e Intuitivity: more intuitive sets of axioms should be favored against less intuitive sets. In fact, this 
factor can be included in the factors of clarity and simplicity. 
e Minimality: this means being less in number and more concise. So, in principle a set of axioms that 
is less in number or/and content and hence it is more concise should be favored (from this perspective) 
against other sets. Again this is an advantage in general since it usually reduces the required work 
and enhances clarity and simplicity. 
e Richness: some sets of axioms may lead to results and consequences that cannot be obtained from 
other rival sets (although both sets can construct the main body of the given theory in general). So, 
richness should be considered as an advantageous factor in constructing and choosing a set of axioms 
for the construction of a given theory. 
PE: Give more potential factors that favor/disfavor some sets of axioms against rival sets of axioms 
for constructing a given theory. 

4. Starting from different sets of axioms could lead to the same theory and could lead to different theo- 
ries.61 Comment on this. 
Answer: Some kinds of differences between sets of axioms do not affect the essence (i.e. they are 
virtually superficial) and hence they lead to the same theory, while other kinds of differences between 
sets of axioms do affect the essence (i.e. they are essential) and hence they lead to different theories. 
PE: Give some authentic examples (from mathematics or science) of different sets of axioms that lead 
to the same theory and other examples of different sets of axioms that lead to different theories. 


3.2. The Basics of Probability Theory 


In simple words, probability is a measure of the likelihood of something to occur (or not). It also 
reflects our partial ignorance of the surrounding conditions and circumstances which leads to our inability 
to predict events and occurrences definitely and deterministically (see Problem 1). Quantitatively, the 
probability is normally expressed as a real number between 0 and 1 (inclusive) where 0 represents certainty 
of non-occurrence and 1 represents certainty of occurrence. An experiment (or trial) in the context of 
probability theory is an action that produces one of a number of possible outcomes (see Problem aye 
The sample space is the set of all possible (individual) outcomes in a given trial, while an event is a 
subset of the sample space. The individual outcomes in the sample space are usually described as points 
of the sample space (or sample points). The elements (or points) of the sample space must be exhaustive 
(which we indicated by “all”) and mutually exclusive (which we indicated by “individual”). The individual 
points in the sample space may be equally likely (i.e. have the same probability) and may be not (where 
in the former/latter case the sample space is described as uniform/non-uniform). It is worth noting 
that the sample space can be discrete (made of countable points) and can be continuous (whose sample 
points constitute a continuum), and the discrete can be finite (made of finite number of sample points) 
and can be infinite (made of infinite number of sample points). We also note that the sample points may 
be described as elementary events or simple events because they cannot be split to simpler events. 
An event occurs if the outcome of the trial belongs to that event (i.e. the outcome is a member of the 
event). The probability of occurrence of a given event A in a given trial with a sample space of a finite 


[56] For example, different sets of axioms (in the context of probability) lead to the same probability theory (i.e. in essence), 
while different sets of axioms (in the context of 2D geometry) lead to different geometries (e.g. Euclidean and non- 
Euclidean geometries). 

[57] We note that “experiment” and “trial” may be used differently, e.g. experiment is a series of trials. 
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number of equally likely sample points is given by: 


P(A) = 5 


(Na > 0, Ns > 0, Na < Nz) (38) 
where P(A) is the probability of A, N4 is the number of elements of A and Ng is the number of elements 
of the sample space.[58l If the points of the sample space are not equally likely then the probability of 
an event A is the sum of the probabilities of all the sample points that belong to A.%! So, in this case 
we need to define and assign beforehand (non-equal) probabilities to the individual points of the sample 
space (e.g. by experiment or guess or hypothesis) under the condition that their sum is 1 (as required by 
axiom 2 of probability theory; see § 3.1). 

An event is described as impossible (or empty) if it contains no element of the sample space and 
described as certain (or entire) if it contains all the elements of the sample space. The probability of the 
impossible event is 0 (i.e. it cannot occur) and the probability of the certain event is 1 (i.e. it must occur) 
while the probability of other types of event (which we may call probable event, i.e. neither impossible 
nor certain) is between 0 and 1 (i.e. it can occur but not necessarily). Accordingly, we can write: 


0<p<1 or 0<P<1 (39) 


where p and P stand for the probability of an event (say A). As indicated already, the sum of probabilities 
of all the sample points is equal to 1. 

Two (or more) events that make the entire sample space and they cannot occur simultaneously are called 
complementary events.!®°! The probability of occurrence of complementary events (ie. one of them 
unspecified) is 1 since they exhaust the sample space (noting that the probability of the sample space is 
1 as indicated already). Accordingly, we have: 


P(A) + P(A) =1 (40) 


where A and A are complementary events. 
The probability of occurrence of “A AND B” is symbolized by P(ANB) and the probability of occurrence 
of “A OR B” is symbolized by P(AU B). We note that: 


P(ANB) = P(BN A) and P(AUB) = P(BUA) (41) 


This is because of the commutativity property of intersection and union of sets, ie. AN B= BNA and 
AUB=BUA (see Egs. 14 and 15). 

Two events are mutually exclusive (or disjoint or incompatible) if they cannot occur simultaneously 
(i.e. if one occurs the other cannot occur at the same time). Accordingly, if A and B are mutually exclusive 
events then: 

P(ANB)=0 (42) 

According to the addition law of probability, if A and B are two events then the probability of 
occurrence of “A OR B” is: 

P(AU B) = P(A) + P(B) — P(ANB) (43) 
The subtraction of P(ANB) in this equation should be obvious because it is counted twice in P(A)+P(B), 
i.e. once in P(A) and once in P(B) (also see part e of Problem 14). 


According to the addition law for mutually exclusive events, if A and B are mutually exclusive 
events then the probability of occurrence of “A OR B” is: 


P(AU B) = P(A) + P(B) (ANB =2) (44) 


[58] In fact, this primarily represents what we may call the repetition or counting method for defining and quantifying 
probabilities. We should also note that being elements of the sample space also implies being exhaustive and mutually 
exclusive (as indicated already). Also see § 1.2 and § 1.3 as well as Problem 2. 

[59] Ty fact, this definition is general and hence it applies even to the case of equally likely sample points. 

[60] In fact, this is based on the definition of complement of event A (symbolized as A) which is the set of outcomes (in the 
sample space) that do not belong to A (see § 2.1). 
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This equation can be obtained from Eq. 43 noting that for mutually exclusive events P(A MB) = 0 (see 


Eq. 42).[64 
The conditional probability P(B|A) of event B given that event A occurred is: 
_ P(ANB) 
P(BIA) = Fray [P(4) wi 0| (45) 


From this equation we get the following relation (which is the multiplication law for associated 
events): 

P(AN B) = P(A) P(B|A) (46) 
Now, if we exchange the symbols of A and B in the last equation then we get P(BM A) = P(B) P(A|B). 
If we also note (see Eq. 41) that P(AM B) = P(BN A) then we get the following relation: 


P(AN B) = P(A) P(B|A) = P(B) P(A|B) (47) 


As we will see (refer to § 6.1), this is the origin of the Bayes theorem. It is important to note that unlike 
P(AUB) and P(BN A) where the order of events does not matter (see Eq. 41), the order of events in 
the equation of conditional probability is important and hence in general we have: 


P(A|B) # P(BIA) (48) 


Yes, they are equal iff P(A) = P(B) as can be seen from Eq. 47. 

Two (or more) events are independent if the occurrence (or not) of any one of them has no effect on 
the probability of occurrence (or not) of the other(s). In technical terms, event B is independent of event 
A iff 

P(B|A) = P(B) (49) 

Now, if we combine this equation with Eq. 46 we get the following relation for the probability of inde- 
pendent events: 

P(AN B) = P(A) P(B) (50) 

In fact, the last relation is the necessary and sufficient condition for the independence of two (non-empty) 

events, i.e. A and B are independent iff P(AN B) = P(A) P(B) (see part d of Problem 14). As we see, 

the independence of two events means there is no involvement of conditional probability between them.|®?! 


Problems 


1. Comment on the above statement: “It also reflects our partial ignorance of the surrounding conditions 
and circumstances which leads to our inability to predict events and occurrences definitely and deter- 
ministically”. 

Answer: We note the following: 

e This statement is true classically but not necessarily in general, e.g. the probability at the quantum 
level may be intrinsic to the phenomena and not because of our partial ignorance of the surrounding 
conditions and circumstances. 

e Probability “reflects our partial ignorance” when attributed to specific events and occurrences in the 
presence of an observer, but in general (considering repetitive trials in similar circumstances with no 
consideration of observer) it reflects in a sense indeterminism in reality either because the reality itself 
is not determined or because the reality is not completely defined and identified (i.e. it is surrounded 
by an element of ambiguity or generality) to be completely determined. The readers are also advised 
to read § 1.3 carefully to avoid potential misunderstanding and confusion.|®*! 

PE: Compare (and hence justify the seeming difference) between the above statement and what we 
presented in § 1.3 about the distinction between subjective and objective probabilities. 


[61] In fact, if we consider the axioms of the probability theory (which we investigated in § 3.1) then this is an axiom rather 
than being derived or obtained from Eq. 43. Actually, this is axiom 3 which is used in the derivation of Eq. 43 (see part 
e of Problem 14). 

[62] Tt is important to distinguish between “independent” and “disjoint” events (noting that “disjoint” could wrongly suggest 
being independent). Also see Problem 29. 

[63] Some of these issues are investigated in our book “The Epistemology of Quantum Physics”. 
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2. Give more details about the outcomes considered in the trials of probability theory and their nature 
and conditions. 

Answer: In general, we assume the following conditions about the outcomes in the trials of probability 
theory: 

e The outcomes should be well defined and reproducible, i.e. the experiment will generate the same 
outcomes if repeated. 

e Each one of the outcomes should have a definite and fixed (initial) probability and hence if the 
experiment is repeated then the outcomes will be produced with the same presumed probabilities (or 
frequencies). 

e The outcomes should be produced individually in each single trial (as indicated in the text by 
“produces one”). 

Note: the purpose of some of these conditions is to include (non-deterministic) random occurrences 
and phenomena and exclude (non-deterministic) haphazard occurrences and phenomena (assuming the 
existence of such occurrences and phenomena) which do not follow any deterministic statistical pattern 
or regularity (unlike random ones which follow such pattern and regularity). We should also note that 
“experiment” could also include “observation”. 

PE: Investigate thoroughly the difference between random events and haphazard events, and hence 
explain why haphazard events (unlike random events) are not subject to the probability theory. 

3. Is the sample space of a given trial unique? 

Answer: In general, the sample space of a given trial is not unique since it depends on the involved 
considerations and categorizations and the intended observations and objectives in the given trial 
(and this appears more vividly and naturally in the cases where the trial has more than 2 possible 
outcomes). For example, the sample space of throwing a die once could be getting one of the numbers 
{1,2,3,4,5,6} or {getting a number < 3, getting a number > 3} among many other possibilities 
(restricted by the condition that the sum of the probabilities of the points of the sample space must 
always be 1 as indicated earlier; see axiom 2 in § 3.1). However, a single possibility (or more) for the 
identification of the sample space is usually necessitated or favored by the nature of the problem and 
its requirements and objectives. For instance, if the objective of the experiment in the above example 
is to obtain the numbers on the faces of the die then the sample space is {1, 2,3, 4,5, 6} (see Problem 
9), while if the objective of the experiment in the above example is to obtain a winning number in a 
game of luck (where the winning number is set to be < 3) then the sample space is {getting a number 
< 3, getting a number > 3}. Accordingly, in the Problems and examples of this book we generally 
assume and adopt a single sample space determined uniquely by the given conditions and requirements 
and the intended objectives of these Problems and examples. 

PE: Is it correct to say: given all the surrounding conditions and the intended objectives the sample 
space of a given trial is unique? If this is true, what about the labeling of the sample space and its 
points? 

4. Discuss briefly how initial probabilities are attached to the discrete and continuous sample spaces. 
Answer: In the case of discrete sample space the initial probabilities are attached to the points of the 
sample space directly subject to the condition that their sum is unity.!®4] In the case of continuous 
sample space the initial probabilities are attached to infinitesimal intervals of the sample space in the 
form of a distribution function subject to the condition that the integral of the distribution function 
over the entire sample space is unity. 

Note: for simplicity, the definitions and formulations that we gave in the text of this section are 
mainly phrased and presented in the terms and forms of discrete sample space. However, they are 
more general and should be understood so. 

PE: Can the sequence 1/2” (n = 1,2,...) represents the initial probabilities of an infinite discrete 
sample space? If so, can you give an example of a sample space whose initial probabilities (i.e. the 
probabilities of its points) are given by this sequence? 

5. Referring to § 1.2 and Problem 4 of the present section, identify the relation between initial and 


[64] Tt should be noted that this applies to infinite sample space as to finite sample space noting that the series of probabilities 
in the former case must be convergent (as demanded by being normalized to unity which we indicated already). 
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10. 


ultimate probabilities as well as their relation to the sample space. 

Answer: In a discrete sample space, each sample point is given a specific initial probability (where 
the sum of all these initial probabilities is unity). The ultimate probability of any event (representing 
a number of sample points) is the sum of the initial probabilities of its sample points. 

In a continuous sample space, each sample (infinitesimal) interval is given a specific initial probability 
(where the integral of all these initial probabilities is unity). The ultimate probability of any event 
(representing an interval) is the integral of the initial probabilities over that interval.!® 

Note: as indicated earlier, the initial probabilities are not required to be equal. We also note that 
initial probabilities are probabilities and hence they satisfy the condition 0 < p < 1. However, if we 
assume that a sample point (or interval) can have a probability of 0 then it should be redundant in 
the calculations of probabilities and hence it can be eliminated from the sample space. 

PE: Give a few examples of sample spaces with their initial probabilities (attached to their sample 
points) and ultimate probabilities (assigned to possible events). Consider both discrete and continuous 
cases. 

What is the sample space for tossing a coin (considering the order in the case of multiple tossing): 


(a) One time. (b) Two times. (c) Three times. 
Answer: We symbolize sample space, head and tail with S, H and T respectively. 
(a) S={H,T}. 


(b) S = {HH, HT,TH,TT}. 

(c) S={HHH, HHT, HTH, ATT,THH,THT,TTH,TTT}. 

PE: What is the sample space for tossing a coin four times? 

Give an example of a (discrete) sample space with an infinite number of sample points. 

Answer: Consider the experiment of throwing a die and counting the number of throws until “1” is 
obtained. The sample space of this experiment is: S = {1,2,3,...} which has an infinite number of 
sample points. This is because the required number of throws to get “1” can be any positive integer 
(i.e. we may get “1” in the first throw or in the second throw or ... etc.). 

PE: Give another example of a (discrete) sample space with an infinite number of sample points. 
Give examples of discrete and continuous sample space. 

Answer: The sample spaces of Problem 6 are discrete (finite). The sample space of Problem 7 is 
discrete (infinite). The sample space representing the lifetime of an excited state of an atom (or the 
lifetime of a radioactive nucleus) is continuous. 

PE: Give more examples of discrete and continuous sample space (both finite and infinite). 

What is the sample space S for the numbers obtained when throwing a die (considering the order in 
the case of multiple throws): 


(a) One time. (b) Two times. (c) Three times. 
Answer: A die has 6 faces which we symbolize with 1, 2,3, 4, 5,6. 
(a) S = {1,2,3,4,5, 6}. 
(b) If “14” for instance means getting “1” in the first throw and getting “4” in the second throw then: 
S = {11,12,13, 14,15, 16,21, 22, 23, 24, 25, 26, 31, 32, 33, 34, 35, 36, 
41, 42, 43, 44, 45, 46, 51, 52, 53, 54, 55, 56, 61, 62, 63, 64, 65, 66} 


We may also express this more compactly as: S = {ab where a,b € {1, 2,3, 4,5, oy}. 


(c) S= {abe where a,b,c € {1,2,3,4,5, 6}. 

PE: Describe how you get the sample space for throwing a die 3 times from the sample space for 
throwing a die 2 times (assuming listing the individual sample points, rather than characterizing them 
in a compact form as we did in part c of this Problem). 

Give examples of experiments whose sample spaces are given by: 


(a) S = {5,10,15,...}. (b) S={a: t<a<A4r}. (c) S = {red, green, blue}. 


[65] For simplicity and clarity we use “the integral of the initial probabilities over that interval” which is loose. 
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11. 


12. 


13. 


14. 


Answer: 

(a) Throwing 2 dice simultaneously until a sum of 12 appears in attempt 5 or 10 or 15 (or a multiple 
of 5). 

(b) Plotting a circle of radius r (1 <r < 2) randomly and calculating its area a. 

(c) Drawing a ball from a bag containing 5 red, 9 green and 3 blue balls. 

PE: Repeat the Problem for the following sample spaces: 

(a) S = {2,3,5,7,11,...}. (b) S={t: 0<t< oo}. (c) S = {dog, cat, rabbit}. 
List a number of classifications for the sample space. 

Answer: For example: 

e Sample spaces may be classified as uniform and non-uniform (as explained already). 

e Sample spaces may be classified as discrete and continuous (as explained already). 

e We may also classify sample spaces as simple (corresponding to simple experiments) and composite 
(corresponding to composite experiments). For example, the sample space associated with an experi- 
ment in which a coin is tossed once is simple because the experiment is simple, while the sample space 
associated with an experiment in which a coin is tossed twice (or an experiment in which a die and 
a coin are tossed) is composite because the experiment is composite since it is a combination of a 
coin-tossing experiment with another coin-tossing experiment (or it is a combination of a die-tossing 
experiment with a coin-tossing experiment). 

Note: composite sample space (of a given composite experiment) is usually obtained by the Cartesian 
product (see Problem 22 of § 2.1) of the sample spaces of the simple experiments that form the given 
composite experiment .!® 

PE: Discuss the sample spaces of Problem 9 in the light of the classifications given in the present 
Problem. 

Identify all the events associated with tossing a coin 1 time. 

Answer: The sample space in this case is S = {H,T}. So, the events are all the subsets of S, that is: 
{}, {H}, {T}, {HT}. 

PE: Identify 8 events associated with tossing a coin 2 times. 

Justify the equation of conditional probability (ie. Eq. 45) by explaining its rationale. 

Answer: It is obvious that in conditional probability the sample space is reduced by the condition, 
i.e. the occurrence of A. In more explicit terms, by the occurrence of A we have a new sample space 
which is A. This should explain why we take the intersection of B with A and divide its probability 
by P(A), since the conditional probability of B is the probability of the part of B that belongs to A 
divided by the probability of A which is the new sample space. So, what we actually do by dividing 
the probability of the part of B that belongs to A by P(A) is to normalize this probability (i.e. the 
probability of the part of B that belongs to A). 

PE: The multiplication law for associated events (i.e. Eq. 46) is a transformed form of the rule of 
conditional probability (i.e. Eq. 45). Explain the rationale of the multiplication law (in a similar or 
different way to the explanation of the conditional probability which is given in the answer of this 
Problem). Also try to find a similarity between the multiplication law and another law or principle 
which we met in counting (see § 2.2). 

Show the following: 

(a) P(A) =1- P(A). 

(b) P(A|B) = P(B|A) iff P(A) = P(B). 

(c) If B is independent of A then A is independent of B. 

(d) A and B are independent iff P(AN B) = P(A) P(B). 

(e) P(AU B) = P(A) + P(B) — P(AN B), ie. the addition law of probability (see Eq. 43). 

(f) P(AN B) = P(A) P(B|A), i-e. the multiplication law of probability (see Eq. 46). 

(g) If A; (¢ =1,2,...) are disjoint and exhaustive events then }); P(A;) = 1. 


(a) As the occurrence of an outcome (in the sample space) is certain, we should have P(A) + P(A) = 1 


[66] In fact, there are many details about this issue (none of which is within our scope). 
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(noting that complementary events are mutually exclusive; see § 3.1 and Eq. 44), and this leads to 
P(A) =1—P(A). In fact, P(A)+ P(A) = 1 is a direct result of axiom 2 and axiom 3 (see § 3.1) noting 
that complementary events are mutually exclusive (and hence they follow axiom 3) and exhaustive 
(and hence they follow axiom 2). 

(b) This can be seen from Eq. 47 Jie. P(A) P(B|A) = P(B) P(A|B)| because if P(A|B) = P(B|A) 
then they can be canceled from both sides (assuming neither is zero) and we get P(A) = P(B). 
Similarly, if P(A) = P(B) then they can be canceled from both sides (assuming neither is zero) and 
we get P(A|B) = P(B|A). 

(c) If B is independent of A then we have: 


P(B|A) = P(B) (Eq. 49) 
P(A) P(B|A) = P(A) P(B) 
P(B) P(A|B) = P(A) P(B) (Eq. 47) 
P(A|B) = P(A) [P(B) 4 0] 
i.e. A is independent of B. 
(d) If A and B are independent then: 
P(A) P(B) = P(A) P(B|A) (B is independent of A, Eq. 49) 
P(ANB) (Eq. 47) 
On the other hand, if P(AN B) = P(A) P(B) then: 
P(AN B) = P(A) P(B) (given) 
P(A|B) P(B) = P(A) P(B) (Eq. a 
P(A|B) = P(A) [P(B) #9] 


i.e. A is independent of B (see Eq. 49). This similarly applies to B (or we use the result of part c), 
and hence we conclude that if P(AN B) = P(A) P(B) then A and B are independent. 

(e) We have A = (AN B)U(A-— B) and B = (AN B) U(B — A) (see part e of Problem 20 of § 2.1). 
Now, if we note that (AN B)N(A-— B) = @ and (AN B)N(B- A) =@ (see part d of Problem 20 of 


§ 2.1), then by axiom 3 of probability (see § 3.1) we have: 
P(A) = P[(ANB)U(A-B)] = P(ANB) + P(A-B) 
and = P(B) = P|[(ANB)U(B-—A)] = P(ANB)+P(B- A) 


On adding these equations side by side we get 


P(A)+ P(B) = P(ANB)+P(A-B)+P(AN B)+P(B- A) 
P(A) + P(B) — P(ANB) Pas . + P(AN B) + P(B— A) 
P(A) + P(B)—P(ANB) = Pi(a- U(ANB)U(B—A) 
P(A)+P(B)—P(ANB) = P(AU 


where in line 3 we used axiom 3 of probability (noting that A— B, AN B and B — A are disjoint; see 
footnote [24] on page 22), and in line 4 we used the result of part (f) of Problem 20 of § 2.1. 

(f) In Problem 13 we rationalized the equation of conditional probability (i.e. Eq. 45), and Eq. 46 is 
no more than a transformed form of Eq. 45. This should be enough as a proof (or justification) for 
Eq. 46. 

(g) We have: 


ST PA) =P WAN] PS) =1 
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where equality 1 is because A; are disjoint (see axiom 3 in § 3.1), equality 2 is because A; are exhaustive, 
and equality 3 is because of axiom 2. 

PE: Show the following by using the axioms of probability theory (see § 3.1) when convenient: 

(a) If AC Bthen P(A) < P(B) (where A and B are events). 

(b) P(@)=0 (where @ represents the empty event). 

If a single card is drawn randomly from a standard deck of cards, what is the probability of being a 
diamond? 

Answer: In a standard deck of cards we have 52 cards all of which are equally likely to be drawn (as 
indicated by “randomly”). Moreover, the possible outcomes are exhaustive and disjoint. Hence, we can 
use Eq. 38 (noting that there are 13 diamonds in the deck), that is: 


number of diamonds 13 1 
(diamond) number of cards 52 «4 


PE: What is the probability of the drawn card to be a non-heart? 
A 3-digit number (i.e. a number between 100 and 999) is chosen randomly. What is the probability 
that the sum of its digits is 3? 
Answer: We have 900 3-digit numbers. Out of these 900 numbers, we have 6 numbers that satisfy 
this condition, i.e. 111, 102, 120, 201, 210, and 300. Hence, from Eq. 38 we get: 

6 1 


P(sum of digits = 3) = 900 = 150 


PE: Repeat the Problem for a sum of 4 (instead of 3). 

A die is thrown randomly in a large number (say 6000) of identical trials. 
follows: 

face 1 is obtained in 995 trials, face 2 is obtained in 1004 trials, face 3 is obtained in 673 trials, face 4 
is obtained in 994 trials, face 5 is obtained in 997 trials, and face 6 is obtained in 1337 trials. Estimate 
the probabilities of obtaining 1, 2,3,4,5,6 for this die. Is this die fair? 

Answer: If P, is the probability of obtaining face 1 (and the rest are similarly defined) then we have: 


[671 The results were as 


995 1 10041 _ 673 1 
6000 6 > 6000 6 *~ 6000 9 
994 1 _ 997 1 JESSE 20 
4 6000 6 > 6000 6 °~ 6000 9 


It is obvious that this die is not fair because (some of) the probabilities of the different faces are not 
equal. Also see § 1.2. 

PE: How do you deal with such situations (i.e. estimating the probabilities statistically as done above) 
if the variable is continuous (such as distance or time) rather than discrete (as in the above example of 
throwing a die)? Would you suggest, for instance, dividing the continuous variable to small intervals 
and binning the outcomes (according to their values) in these intervals? 

Give a simple example of impossible event, certain event and probable event. 

Answer: If nine cards are numbered from 1 to 9 and one of these cards is drawn at random then 
getting a number larger than 9 is an impossible event, and getting a number between 1 and 9 (inclusive) 
is a certain event, while getting an odd number is a probable event. 

PE: Do the following: 

(a) Repeat the Problem for throwing a die 1 time. 

(b) Give another example for impossible, certain and probable events from the nine cards example 
given in the answer. 


[67] 


“Identical trials’ should imply that the physical conditions of the die did not change during (and possibly by) these 
repetitive trials. It should also imply that the ambient conditions (including the individual, or machine, who throws the 
die and his physical and mental conditions) remained the same during these trials. 
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Give some simple examples of complementary events. 

Answer: Examples of complementary events are: 

e “Getting odd number” and “getting even number” in a die-throwing experiment. 

e “Getting head” and “getting tail” in a coin-tossing experiment. 

e “Passing the test” and “failing the test” (assuming they are the only possible outcomes of an exam). 
PE: Give some examples of complementary events in science (e.g. physics) related to probability. 
Give simple examples of mutually exclusive and non-“mutually exclusive” events. 

Answer: In a die-throwing experiment, “getting odd number” and “getting even number” are mutually 
exclusive, while “getting odd number” and “getting number > 3” are not mutually exclusive. 

PE: Give two more examples of mutually exclusive and non-“mutually exclusive” events from a die- 
throwing experiment. 

Make a clear distinction between complementary events and mutually exclusive events. 

Answer: Complementary events are necessarily mutually exclusive, but mutually exclusive events are 
not necessarily complementary. This is because complementary events are those mutually exclusive 
events that make the entire sample space. For example, “getting 1” and “getting 3” in a die-throwing 
experiment are mutually exclusive (because they cannot occur simultaneously) but they are not com- 
plementary (because they do not represent the entire sample space). Yes, “getting 1” and “not getting 
1” are complementary (as well as mutually exclusive). Also, “getting 1”, “getting 2”, ... , “getting 6” 
are complementary (as well as mutually exclusive). 

PE: Classify the following events as complementary or/and mutually exclusive or not (considering a 
trial in which we draw a ball from an urn containing 10 red, 6 black and 7 green balls): 

(a) “Getting green ball” and “not getting black ball”. 

(b) “Getting black ball” and “getting red or green ball”. 

(c) “Getting red ball” and “getting black ball”. 

The medical staff of a small clinic consists of two male doctors and one female doctor as well as four 
female nurses and two male nurses. If a member of medical staff is picked randomly, symbolize the 
following probabilities: being a male doctor, being a female nurse, being a male or nurse, and being a 
doctor or female. 

Answer: If D, N, M, F stand for doctor, nurse, male, female then these probabilities should be sym- 
bolized as follows: 


P(MND) P(FON) P(MUN) P(DUF) 


PE: Repeat the Problem for the following probabilities: being neither male nor nurse, being a female 
but not doctor, being a male nurse, and being a doctor or nurse. Try to correlate these probabilities 
(or some of them) to the above probabilities, i.e. those given in the Problem. 

Calculate the probabilities of Problem 22. 

Answer: Due to the simplicity of this Problem we solve it by simple count!®*! instead of using formulae 
(noting that the formulae will be used for verifying these probabilities in Problem 25). 

e We have two male doctors (out of nine staff) and hence P(MN D) = 2/9. 

e We have four female nurses (out of nine staff) and hence P(F'N N) = 4/9. 

e We have eight members who are male or/and nurse and hence P(M U N) = 8/9. We may also say: 
we have only one female doctor (i-e. neither male nor nurse) and hence P(M UN) = (9— 1)/9 = 8/9. 
e We have seven members who are doctor or/and female and hence P(DUF’) = 7/9. We may also say: 
we have only two male nurses (i.e. neither female nor doctor) and hence P(DU F) = (9 — 2)/9 = 7/9. 
PE: Calculate the probabilities of the PE of Problem 22. 

Which of the events of the 4 pairs of events in Problem 22 are independent and which are not? 
Answer: The probabilities of D, N, M, F are: 


3d 6 2 A 5 


[68] We note that simple count is a basic and intuitive method for calculating probabilities and is based on the basic 


definition of probability (see Eq. 38). However, it is limited in applicability to very simple problems and limited in 
validity to the cases of equal initial probabilities. 
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To test if the events in the pair M,D (for instance) are independent or not we use the condition 
of Eq. 50, ie. they are independent if PUM nM D) = P(M)P(D) and they are not independent if 
P(M 1D) 4 P(M) P(D) [see Eq. 50 and part (d) of Problem 14]. This test similarly applies to the 


other pairs. Accordingly:!°! 


P(M)P(D) = g* 37 aA 9 ~ POND) 
P(F)P(N) = g* an ae PEM 
P(M)P(N) = g*a7 at POEM 
P(D)P(F) = ax 2a 2 5 = P(DNF) 


Therefore, we conclude that the events in none of these pairs are independent. 

PE: Repeat the Problem for the 4 pairs of events in the PE of Problem 22. 

Use the multiplication and addition laws of probability to verify the probabilities P(MND), P(FNN), 
P(MUN) and P(DU F) which we obtained in Problem 23 by simple count. 

Answer: For P(M 1m D) and P(F'NN) we use the multiplication law (ie. Eq. 46), that is: 


ge ae 
P(FAN) = PF) P(N |F) = 2 x 


P(MND) = P(M)P(D|M)=2x72=2 


4 
5 9 
), that is: 


For P(MUN) and P(DU F) we use the addition law (ie. Eq. 43 
P(MUN) = P(M)+P(N)—P(MNN)= 
P(DUF) = P(D)+P(F)—P(DNF)= 


These results are identical to the results of Problem 23. 
PE: Repeat the Problem for the 4 pairs of events in the PE of Problem 23. 
Give some examples of disjoint events related to the trial of Problem 22. 
Answer: It is obvious that the events M, F are disjoint because no member of staff can be male and 
female. Also, the events D, N are disjoint because no member of staff can be doctor and nurse. All 
other pairs are not disjoint.!”°l 
PE: Referring to Problem 22, give examples of events which are: 
(a) Non-disjoint. (b) Non-complementary. (c) Disjoint but non-complementary. 
Referring to Problem 22, if the member of staff that we picked randomly was a male, find the probability 
of (a) being a nurse and (b) being a doctor. 
Answer: 
(a) We use the rule of conditional probability (i.e. Eq. 45), that is (see Problem 24 for the employed 
probabilities): 
P(NOM) 2/9 1 

P(M) 4/9 2 
This is reasonable because we have four males in the staff two of them are nurse and hence if the picked 
person is male then the probability of being a nurse should be 2/4 = 1/2. 


P(N|M) = 


[69] We note that P(M Nn D) = 2/9 and P(F MN) = 4/9 (see Problem 23). Also, by using the simple count method (which 


we used in Problem 23) we have P(MNM N) = 2/9 and P(DN F) = 1/9. 


[70] Ty fact, we have six pairs which we may label compactly as: DN, DM, DF,NM, NF, MF. As we see, only the pair DN 


and the pair MF are disjoint. However, all this is about pairs; otherwise we have other disjoint events such as being 
male doctor and being female doctor (or female nurse). 
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(b) We can use the rule of conditional probability (as we did in part a) but if we note that N and 
D are “complementary” (i.e. in the new sample space which is the sample space made of males) then 
we can immediately conclude that P(D|M) = 1— P(N|M) = 1/2. This should also be reasonable 
because we have four males two of them are doctor and hence if the picked person is male then the 
probability of being a doctor should be 2/4 = 1/2. 

PE: Referring to Problem 22, if the chosen member of staff was a nurse, find the probability of being 
a female. 

Referring to Problem 22, explain the meaning of: Pl(N NM)U(Mn N)| 


Answer: It means the probability of picking a female nurse or a male doctor. 
PE: Referring to Problem 22, what is the meaning of: 
(a) P[(DaF)u(FnD)]. (b) P[(NNF)U (FAN). (c) P[(FnD)U(MnN)). 
Determine the requirement that a pair of events (say A and B) should satisfy to be independent AND 
disjoint. 
Answer: If A and B are independent then P(AN B) = P(A) P(B), and if A and B are disjoint then 
P(ANB) =0. On combining these conditions we obtain P(A) P(B) = 0, i.e. A and B are independent 
AND disjoint if P(A) = 0 or P(B) = 0 (i.e. one of them is impossible). In fact, this should indicate 
that disjoint events are highly dependent (since the occurrence of one means the negation of the 
other). For instance, if A and B are two disjoint events where P(A) = 1/2 and P(B) = 1/3 then 
P(A|B) = 0 4 P(A) = 1/2 because P(AN B) = 0. Similarly, P(B|A) = 0 4 P(B) = 1/3 because 
P(AN B) =0. As we see, P(A|B) # P(A) and P(B|A) 4 P(B) which means that these events are 
dependent. So in brief, (non-trivial) disjoint events must be dependent. 
Note: the dependence of disjoint events can be shown in other ways. For example, if we use the 
method of Problem 24 for testing the dependence of events then we can write (assuming neither A or 
B is impossible): 

P(AN B) =04 P(A) P(B) 
i.e. if A and B are disjoint [as required by P(AN B) = 0] then they cannot be independent [because 
P(A) P(B) 4 0] noting that if A and B are independent then we must have P(AN B) = P(A) P(B). 
PE: Let A and B be two dependent events. List all the possibilities for the relationship between them. 
Two pregnant women gave birth to two babies (i.e. each woman gave birth to a single child). What is 
the probability that at least one of the babies is a boy (assuming that the probabilities of having boy 
and having girl are equal and supposing that “boy” and “girl” are the only possibilities)? 
Answer: We have 4 equal possibilities: BB, BG, GB, and GG (where B stands for boy and G for 
girl and noting that the order in BG and GB does matter since the position of B and G corresponds 
to a specific woman). As we see, in 3 of these 4 possibilities we have at least one B and hence the 
probability that at least one of the babies is a boy is 3/4. We can also argue that “at least one of 
the babies is a boy” and “both babies are girl” are complementary and since the probability of “both 
babies are girl” is obviously 1/4 then the probability of “at least one of the babies is a boy” should be 
1— (1/4) = 3/4 (see Eq. 40 and part a of Problem 14). 
PE: If one woman in this Problem gave birth to a twin (ie. two babies) while the other gave birth 
to a single baby, what is the probability that all the babies are girls (assuming equal probabilities and 
dual possibilities for the gender of babies)? 
The target in a firing test is a square board of area 0.25m? inside which a circle of radius 0.1m is 
inscribed. To pass the test, the bullet should hit the interior of the circle. Assuming that hitting the 
square is certain (or the square is hit already) and the probability of hitting any part of it is the same, 
what is the probability of passing the test? 
Answer: The probability of passing the test should equal the ratio of the area of the circle to the area 
of the square. The area of the circle is x 0.1? = 0.017m?. Hence, the probability of passing the test 
is 0.017 /0.25 = 1/25 ~ 0.126. 
PE: Repeat the Problem but replace the square by a circle of radius 0.25m. 
An integer number between 5 and 50 (inclusive) is chosen randomly. What is the probability of “being 
prime AND multiple of 3”? 
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Answer: The event of “being prime AND multiple of 3” is impossible and hence the probability of 
this is 0. 

PE: Repeat the Problem for the probability of “being prime OR multiple of 3”. 

An integer number between 1 and 50 (inclusive) is chosen randomly. What is the probability of “being 
divisible by 2 OR divisible by 5”? 

Answer: Let dz and ds mean divisible by 2 and divisible by 5 respectively. We have 25 numbers 
divisible by 2 (i.e. the even numbers 2,4,...,50) and hence P(dz) = 25/50 = 1/2. We have 10 
numbers divisible by 5 (i.e. the multiples of five 5,10,...,50) and hence P(ds) = 10/50 = 1/5. We 
also have 5 numbers divisible by 2 AND divisible by 5 (i.e. the multiples of ten 10,20,...,50) and 
hence P(dzM ds) = 5/50 = 1/10. On using the addition law of probability (see Eq. 43) we get: 


P(dzU ds) = P(dz) + P(ds) — P(d2N ds) = ; + : = a = : 
PE: Repeat the Problem for the probability of being: 
(a) Prime. (b) Perfect square. (c) Prime OR perfect square. 
(d) Prime AND perfect square. (e) Perfect square AND even. (f) Perfect square AND odd. 
(g) Prime AND odd. (h) Prime OR odd. (i) Prime AND even. 


A die and a coin are thrown simultaneously. What is the probability of getting T (on the coin) and 2 
(on the die)? 
Answer: It is obvious that the outcomes of the die and the coin are independent. Hence, we can use 
the multiplication law for independent events [noting that P(T’) = 1/2 and P(2) = 1/6 assuming they 
are fair] , that is: 

Pn?) =P) Peystxtast 

- a a ee 

We may also use simple count, i.e. we have 12 equally likely outcomes which are: 


H1, H2, H3, H4, H5, H6,T1,T2, 73, 74,75, T6 


only one of which (i.e. T2) meets the requirement and hence P(T'M 2) = 1/12. 

PE: Repeat the Problem for the probability of getting: 

(a) H (on the coin) and even number (on the die). 

(b) “H AND odd number” OR “T AND number less than 4”. 

An electronic device requires a transistor whose chance of being defective is 3% and a resistor whose 
chance of being defective is 1%. The device works only if both these components are perfect. Assuming 
that all the other components in the device are perfect and the defects of the transistor and resistor 
are independent, what is the probability of the device being defective. 

Answer: If T and R stand for perfect transistor and perfect resistor respectively, then P(T) = 0.97 and 
P(R) = 0.99, and hence by the multiplication law for independent events (see Eq. 50) the probability 
of the device being perfect is: 


P(T7 R) = P(T) P(R) = 0.97 x 0.99 = 0.9603 
Accordingly, the probability of the device being defective is: 
P(T OR) =1- P(T NR) = 1 — 0.9603 = 0.0397 


where the bar means “defective” (and noting that “perfect” and “defective” are complementary; see Eq. 
40 and part a of Problem 14). 
We may also use the addition law of probability (see Eq. 43 as well as Eq. 21), that is: 


P(TOR) = P(TUR) = P(T) + P(R) — P(FNR) = 0.03 + 0.01 — (0.03 x 0.01) = 0.0397 


PE: Repeat the Problem assuming this time that P(T’) = 0.02 and P(R) = 0.04. 
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Referring to Problem 27 of § 2.2, what is the probability of any specific distribution (where “specific 
distribution” means for instance balls 2, 5, 6 are in the 3-ball bag, balls 1, 4, 8, 11, 12 are in the 5-ball 
bag and the other balls are in the 8-ball bag)? 
Answer: According to the result of Problem 27 of § 2.2 we have 720720 possibilities and hence the 
probability of any specific distribution (i.e. the probability of any one of these possibilities) is 1/720720 
(noting that all these possibilities are presumably equally likely). 
PE: Repeat the Problem for the PE of Problem 27 of § 2.2. 
Referring to Problem 32 of § 2.2: 
(a) What is the probability that the last 2 drawers remain vacant? 
(b) What is the probability that at least one of the last 2 drawers is filled? 
Answer: We assume that all ways of storing are equally likely. 
(a) The event “the last 2 drawers remain vacant” is equivalent to the event “the shirts are stored in the 
first 5 drawers”. Now, we have 5! possibilities for filling the first 5 drawers (i.e. the first drawer can 
store any one of the 5 shirts, the second drawer can store any one of the remaining 4 shirts, and so 
on), which means that we have 5! possibilities for the last 2 drawers being vacant. Also, from Problem 
32 of § 2.2 we know that the total number of possibilities for storing these shirts is 2520. Hence, the 
probability that the last 2 drawers remain vacant is 5!/2520 ~ 0.0476. 
We may also argue that “the last 2 drawers remain vacant” is just one possibility of the CZ (equally 
likely) possibilities of selecting 5 drawers (out of 7) for storage and hence the probability that the last 
2 drawers remain vacant is 1/Cf = 1/21 ~ 0.0476. 
(b) The event “the last 2 drawers remain vacant” and the event “at least one of the last 2 drawers is 
filled” are complementary. Hence, the probability that at least one of the last 2 drawers is filled is 
1 — (5!/2520) ~ 0.9524. 
PE: Referring to the PE of Problem 32 of § 2.2, what is the probability that the second, fourth and 
ninth garages remain vacant? 
Referring to Problem 37 of § 2.2, a byte is selected randomly. What is the probability that the sum of 
its bits is 3? 
Answer: A byte contains 8 bits each of which can be 0 or 1 and hence by the fundamental principle 
of counting there are 2° = 256 different bytes. Also, from the method of Problem 37 of § 2.2 we know 
that we have C$ = 56 different bytes that contain exactly three 1’s (and hence the sum of their bits is 
3). Therefore, the required probability is 56/256 ~ 0.2188. 
PE: Repeat the Problem to find the probability that the sum is greater than 5. 
Show that if we distribute n balls in n bags randomly (where there is no limit on the capacity or storage 
of bags), then for relatively large n the probability of each bag containing 1 ball is approximately 
e "V/2rn. 
Answer: Referring to part (a) of Problem 34 of § 2.2, balls are like distinguishable particles and 
bags are like states (with no limit on their occupancy) and hence we can use the Maxwell-Boltzmann 
statistics. Accordingly, we have n” possibilities for the distribution of n balls in n bags. Moreover, we 
have n! possibilities for distributing n balls in n bags such that each bag contains 1 ball (since the 1%¢ 
bag can contain any one of the n balls, the 2" bag can contain any one of the remaining n — 1 balls 
and so on, and hence we have n! possibilities). Therefore, the probability of each bag containing 1 ball 
is: ‘3 

— wre "V2rn (large n) 

nr 
where we used in making this approximation the Stirling formula for approximating factorials for large 
n (see Eq. 37). In fact, the last equation gives a good approximation even for small n. To demonstrate 
the accuracy of this approximation we present in Table 1 a sample of values that n can take and 
compare the exact value of probability (i.e. n!/n”) with its approximate value (i.e. e~" V27n). 
PE: Build a spreadsheet in which you calculate n!/n” and e~"/2rn for a range of values of n (e.g. 
n = 2 to n = 100) and compare the two by plotting them (or their ratio or relative difference) on a 
chart. 
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Table 1: The table of Problem 39 of § 3.2. 


n 
2 5 10 15 20 
n!/n” 0.5 | 0.03840 | 3.629 x 10-4 | 2.986 x 10-8 | 2.320 x 1078 
e—"/27Nn | 0.47975 | 0.03777 | 3.599 x 10-4 | 2.970 x 10-6 | 2.311 x 10-8 


3.3. Extensions and Generalizations 


Most of the relationships and laws of probability involving two events (which we investigated and stated 
earlier in § 3.2) can be extended and generalized to more than two events. In this section we present a 
number of these extensions and generalizations. 

The addition law of probability for three events A, B,C is given by: 


P(AU BUC) = P(A) + P(B) + P(C) — P(AN B) -— P(ANC)— P(BNC)+P(ANBNC) (51) 
By induction, this formula can be extended to n events, that is: 
P (U 4 = 5 > P(Ai)— >> P(A;N Aj) + 
i=1 i=l iAj 
S> P(AiN AM Ag) — +++ (1)? P(AL Aa N+ An) (52) 
ip-jZk 
where U?_, A; means the union of all the A,’s (i.e. UL, Aj = Ay U Ag U...UA,), and all indices run over 


nJ/l Tf A; are pairwise exclusive events, the last equation (ie. Eq. 52) becomes: 


t=1 t=1 


The multiplication law of probability for three events A, B,C is given by: 
P(ANBNC) = P(A) P(BIA) P|c\(AnB)| (54) 


This equation can be easily extended to more than three events (by noting the pattern of Eq. 54), that 
is: 


P(ALM Ag M+. An) = P(A1) P(A2 Ax) P[As |(Ar 9 Aa)| «+ PL An [419420 An-i)] (58) 
The multiplication law of probability for three mutually independent events A, B,C is given 


by: 
P(AN BNC) = P(A) P(B) P(C) (A, B,C are mutually independent) (56) 


This formula can be extended to n mutually independent events, that is: 
P (A 4 = II P(A;) (A; are mutually independent) (57) 
i=1 i=1 


where M7_, A; means the intersection of all the A,’s (ie. NLA; = A1N AN... A,,\ 7] 


[71] We note that (—1)”~! in the last term of Eq. 52 is specific to the last term. The sign of each sum of terms is (—1)7! 
where m is the number of events considered in the probabilities of that sum, i.e. 1 event for S>/_, P(A;), 2 events for 
Digj P(AiN Aj), and so on. 


[72] We note that Eqs. 56 and 57 should be seen as part of the definition of mutual independence (see Problems 1 and 2). 
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If event A is a union of n pairwise exclusive events and B is another event (in the same sample space) 
then from Eq. 53 we have: 


P(ANB) = oP mB) (58) 


Moreover, if A represents the entire sample space then from Eq. 58 we get: 
P(B) = P(ANB) = oP mB) (59) 


where we used Eq. 7 in the first equality (noting that A is the universal set since it represents the entire 
sample space). 


Problems 


1. Distinguish between pairwise independence and mutual independence of a number of events (i.e. > 3 
events). 
Answer: We describe a collection of events as pairwise independent if any two events in the 
collection are independent, and we describe the collection as mutually (or jointly) independent if 
any number of events in the collection are independent. For example, if A, B,C are three events then 
A, B,C are pairwise independent if: 


P(AN B) = P(A) P(B) & P(ANC) = P(A) P(C) & P(BNC) = P(B) P(C) 
However, A, B,C are mutually independent if in addition to these three conditions we also have: 
P(AN BNC) = P(A) P(B) P(C) 


Accordingly, mutual independence is stronger than pairwise independence, i.e. if a number of events 
are mutually independent then they are pairwise independent but if they are pairwise independent 
then they are not necessarily mutually independent. 

Note: pairwise independence of random events A; (i = 1,2,...,n where n can be infinite) is expressed 
mathematically by the condition: 


while mutual independence of random events A; is expressed mathematically by the combination of 
conditions:!79] 


°(A a) = | [lp 


4=1 


PE: A, B,C,D are four events. Write the conditions for their (a) pairwise independence and (b) 
mutual independence. 
2. Give examples for the following: 
(a) Events which are not pairwise independent (and hence they are not mutually independent). 
(b) Events which are pairwise independent but not mutually independent. 
(c) Events which are mutually independent (and hence they are pairwise independent). 


[73] To put it in plain words, this combination of conditions means: a number of events are mutually independent iff every 
combination of these events (involving any number of these events) is independent. 
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Answer: Let A, B,C be three events. 
(a) If we draw a card (randomly) from a pack of 6 cards (numbered 1, 2,3,4,5,6) where: 


A= {1,2,3} B= {3,4,5} C = {1,5,6} 
then enae 
P(A) P(B) = 5X5 = gt g = PCAN) 
P(A)P(C) = ee eee) 
P(B)PC) = 5x5 =4%_=P(BNC) 
P(A) P(B)P(C) = aX 5X gag FO=P(ANBNC) 


i.e. A, B,C are neither pairwise independent nor mutually independent. 
(b) If we draw a card (randomly) from a pack of 12 cards (numbered 1, 2,...,12) where: 


A = {1,2,3,4,5, 6} B= {4,5,6,7,8,9} C = {1,2,3,7,8, 9} 


then we have: 


Le ch 1 
PAPC). bee a lapaac) 
nk se meee ee 
1 1 1 
1 cho by 
i.e. A, B,C are pairwise independent but not mutually independent. 
(c) If we draw a card (randomly) from a pack of 16 cards (numbered 1, 2,...,16) where: 
A = {1,2,3,4,5,6, 7,8} B= {5,6,7,8,9, 10, 11, 12} C = {3,4, 7,8, 9, 10, 13, 14} 


then we have: 


P(A)P(B) = 5x 5=7=P(ANB) 
P(A) P(C) = 5X5 == P(ANG) 
P(B)P(C) = 5X5 = 7 = P(BNO) 

P(A) P(B) P(C) = aX 5% 5g = P(ANBNO) 


i.e. A, B,C are pairwise independent and mutually independent. 

Note: as explained above, to have mutual independence we need to have pairwise independence (as well 
as higher-multiplicity independence). In other words, higher-multiplicity independence does not imply 
or guarantee pairwise independence and hence we need to verify pairwise independence independently 
of verifying higher-multiplicity independence. For example, if A,B,C are three events in a trial of 
drawing a card (randomly) from a pack of 16 cards (numbered 1, 2,..., 16) where: 


A= {1,2,3,4,5, 6,7, 8} B = {7,8,9, 10, 11, 12, 13, 14} C = {6,7,8,9, 10, 14, 15, 16} 
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then we have: 


P(A) P(B) = X52 gh gs P(ANB) 

P(A)P(C) = 5x 5= 747g =P(ANO) 

P(B)P(C) = 5X5 =a ag = PBN) 
P(A) P(B)P(C) = aX 5% gg = P(ANBNO) 


As we see, although the condition P(AN BNC) = P(A) P(B) P(C) is satisfied, the events A, B,C 
are not mutually independent because they are not pairwise independent. So in brief, the condition 
P(AN BNC) = P(A) P(B) P(C) does not guarantee (or imply) pairwise independence and hence it is 
not sufficient (although it is necessary) for mutual independence. This similarly applies to more than 
three events where higher-multiplicity independence does not guarantee pairwise or lower-multiplicity 
independence. 
This example should also show that the condition of mutual independence in Eqs. 56 and 57 is stronger 
than we need for the validity of these equations; in other words it is sufficient but not necessary. We 
also note that this example, in addition to the example of part (b) of this Problem, should show that 
pairwise independence (in itself) is neither sufficient nor necessary for the validity of Eqs. 56 and 
57 (even though mutual independence, which implies pairwise independence, was imposed in these 
equations to ensure their validity). 
PE: Repeat this Problem by giving other examples for the three parts as well as an example similar 
to the example given in the note. 

3. Verify Eq. 54 for the events of: 


(a) Part (a) of Problem 2. (b) Part (b) of Problem 2. 
(c) Part (c) of Problem 2. (d) The note of Problem 2. 
Answer: 
(a) 
P(ANBNC)=0 = 5X qx 0= P(A) P(BIA) P[O|(An B)| 
= 5X5 x 0= PCB) P(C|B) P[A\(BN©)] 
a, 5X gx 0= PC) P(A|C) P[B\(Cn A)| 
(b) 
P(ANBNC)=0 = 5X 5 x 0= P(A) P(B|A) P[C|(AN B)| 
1 1 
=0 = 5x5x0=P(B) P(C|B) P[A\(BN©)] 
1 1 
=0 = 5x 5x0=P(C) P(A|c) P[B\(Cn A)| 
(c) 
1 1 1 1 
P(ANBNC)=5 = 5X 5% 5 = P(A) P(BIA) P[CI(AnB)| 
1 1 1 1 
=5 = 5X55 =P(B) P(C|B) P[A\(BN©)] 
1 1 1 1 
=5 = 5x5x5=P(C) P(A|C) P[B\(Cn A)| 
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(d) 


P(ANBNC) = 5 - 5x 7x1= P(A) P(B|A) P[C|(An B)| 
=5 = a 2 = P(B ) P(C|B) P[A\(BN©)] 
=5 Z as = PC ) P(A|C) P[B\(Cn A) 


PE: Repeat this Problem for the examples of the PE of Problem 2. 

4. What you notice from the results of Problems 2 and 3? 

Answer: We notice that Eq. 54 is always true but Eq. 56 is not always true because it is conditioned 
by mutual independence (noting that “always” here is within the investigated cases although we know 
from other evidence that it is true in general). 

PE: Do you recommend using Eq. 54 in all cases (i.e. of three events and its alike for more than three 
events; see Eq. 55) especially when there is some confusion about the nature of the events with regard 
to their independence? 

5. Distinguish between mutual exclusivity and pairwise exclusivity. 

Answer: Let define mutual exclusivity on the style of mutual independence, that is: a collection of 
events (ie. > 3 events) are mutually exclusive if any number of events in the collection are exclusive 
(i.e. their intersection is empty and hence the probability of the intersection is zero). Accordingly, if 
a number of events are mutually exclusive then they should be pairwise exclusive. We also note that 
if a number of events are pairwise exclusive then they should be mutually exclusive. 

PE: Why if a number of events are pairwise exclusive then they should be mutually exclusive? 

6. Referring to Problem 5, can we say: mutual exclusivity and pairwise exclusivity are equivalent? 
Answer: Yes, we can say this but at the practical level (ie. if a collection of events are pair- 
wise/mutually exclusive then they should be mutually/pairwise exclusive). However, at the concep- 
tual level they are not equivalent because mutual exclusivity is a stronger (or richer) condition than 
pairwise exclusivity since it embeds multiple-events exclusivity (in addition to pairwise exclusivity). 
To clarify the idea further, let define inclusivity as the opposite of exclusivity. Accordingly, both: 


A= {i,2,3} B = {3,4,5} C = {1,5,6} 


and 
D = {1,2,3,7} E = {3,4,5, 7} F = {1,5,6,7} 


are pairwise inclusive, but only D, E,F are mutually inclusive because DN EM F = {7} 4 @ while 
AN BNC = @ (which is an extra condition imposed by mutual exclusivity but not by pairwise 
exclusivity). 
PE: Compare pairwise/mutual exclusivity with pairwise/mutual independence and try to justify any 
difference. 

7. Prove the addition law of probability for three events A, B,C (i.e. Eq. 51). 


Answer: 
P(AUBUC) = P[Au (BUC)| 

= P(A)+P(BUC)-— P[An (BU 0)| 

= P(A)+P(B)+P(C)—P(BNC)— PIA A(BU c)| 

= P(A)+P(B)+P(C)—P(BNC)-— P[(AnB)u B)U(ANC) | 

= P(A)+ P(B)+P(C)—P(BNC)—P(An a P(ANC) 

+ PI(A NB)N(An c)| 
= P(A)+P(B)+P(C)— P(BNC)— P(ANB)— P(ANC) + P(ANBNANC) 
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11. 


= P(A)+P(B)+ P(C) — P(AN B) -— P(ANC)— P(BNC)+ P(AN BNC) 


where in equality 1 we use Eq. 18, in equality 2 we apply Eq. 43 on P [4 U(BU c)| |41 in equality 
3 we apply Eq. 43 on P(BUC), in equality 4 we apply Eq. 19, in equality 5 we apply Eq. 43 on 
P|(ANB)U(AN c)| , in equality 6 we apply Eq. 17, and in equality 7 we apply Eq. 5 (in association 
with Eq. 14). 

PE: Derive the addition law of probability for four events A, B,C, D. 


. Prove the multiplication law of probability for three events A, B,C (i.e. Eq. 54). 


Answer: 
P(ANBNC) = P|(AnB)nc| (Eq. 17) 
= P(ANB) Pic (AN B)| (Eq. 47) 
= P(A) P(B|A) Pic (AN B)| (Eq. 47) 


PE: Can we write P|(B nc) [A = P(B|A) Pic (AN B)| ? 


. Derive the multiplication law of probability for four events A, B,C, D. 


Answer: 
P(ANBNCND)=P|(ANBnc)nD| (Eq. 17) 
= P(ANBNC) P[D|(ANBnc)| (Eq. 47) 
= P(A) P(B|A) Pic (AN B)| P|D (AN BAC) (Eq. 54) 


PE: Derive the multiplication law of probability for five events A, B,C, D, E. 

In a class made of 50 students, 20% have a height of less than 160cm, 64% a height between 160cm- 
180cm, and 16% have a height greater than 180cm. If 5 students are selected at random from this 
class, what is the probability of all 5 being less than 160cm tall? 

Answer: We have C3° combinations for selecting 5 students out of the 10 students whose height is 
less than 160cm, and we have C2° combinations for selecting 5 students out of the 50 students. Hence: 


10 
is = ae ~ 0.0001189 


P(height of all 5 < 160em) = 23, = =——— 
5 


We may also argue (differently) that if we select the 5 students successively then the probabilities of the 
1st, ard 3rd 4th 5'” student to have a height less than 160cm are (10/50), (9/49), (8/48), (7/47), (6/46) 
and hence by the multiplication law of probability (for dependent events)!”>! we have: 


1 
P(height of all 5 < 160cm) = = x 7 x = x a x . ~ 0.0001189 


PE: Find the probability of: 

(a) Exactly 1 student of the selected 5 being less than 160cm tall, and exactly 1 student of the selected 
5 being greater than 180cm tall. 

(b) Exactly 3 students of the selected 5 being greater than 180cm tall. 

Let U be the set of lower case English alphabet, A = {a,b,d,k,r,t,z}, B = {a,c,d,m,r,t,w, xz}, and 
C = {a,m,n,q,7r, 8,t}. Ifa lower case English letter is selected randomly, verify Eq. 51 for P(AUBUC) 
in this case. 

Answer: The union AU BUC = {a,b,c,d,k,m,n,q,7r, 8,t, w, x, z} contains 14 elements. Moreover, 


[74] We note that Eq. 43 was derived in part (e) of Problem 14 of § 3.2. 
[75] In fact, we are using Eq. 55 which is an extended form of Eq. 54. 
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we have 26 lower case English letters. Hence, by simple count we have P(AU BUC) = 14/26 = 7/13. 
Regarding Eq. 51 we have [following a simple count approach as we did already with P(AU BU C)] : 


P(A) = 7/26 P(B) = 8/26 P(C) = 7/26 P(AN B) = 4/26 
P(ANC) = 3/26 P(BNC) = 4/26 P(AN BNC) = 3/26 
Hence, from Eq. 51 we get: 


Qe) 6°36 26 26 20'6 2 B 


which is identical to what we got earlier by simple count. So, Eq. 51 for P(AU BUC) is verified in 
this case. 

PE: Do the following: 

(a) Create and solve a Problem similar to this using other U (such as a set of numbers, or a set of 
plane geometric shapes, or a set of chemical elements in the periodic table). 

(b) If D = {a,b, f, j,k, p,t, w, y, z} (and A, B,C are as above), verify Eq. 52 for P(AUBUCUD). 
What is the probability that a group of n persons have different birthdays (i.e. no two of them have 
the same birthday)? Also find the minimum n for which this probability is less than 2/3. 

Answer: We assume the following: 

e The year has 365 days. 

e The birthdays of the group of people in question are totally independent of each other (e.g. we do 
not have twins). 

e Any person in the group has equal probability to be born in any day of the year and hence the 
probability of him being born in a specific day of the year is 1/365 (e.g. we do not have people selected 
on the basis of their birth date which favors or disfavors certain birthdays such as those in the summer 
or spring). 

Now, if we select one person from the group randomly then there is no restriction on his birthday. If 
we repeat the selection then the second person has a 1/365 probability of being born in the day of the 
birthday of the first person and hence the probability that the second person has a birthday different 
from the birthday of the first person is 1 — (1/365). Similarly, the third person has a probability of 
1 — (2/365) that he has a birthday different from the first two persons. So, in general the n‘” person 
has a probability of 1— nt that he has a birthday different from the n —1 previously-selected persons. 
Accordingly, by the law of multiplication (for dependent events) ,|761 the probability Pa, that a group 
of n persons have different birthdays is: 


Fain = (1 a) (1 as) ae (1- =) “I (1- =) ce) 


We note that for n > 365 this probability becomes 0 according to this formula (as it should be). 
Regarding the minimum n for which this probability is less than 2/3, it can be found simply by trial 
using for instance a spreadsheet. Following this method we found that the minimum n is 18. 

Note: starting from Eq. 60, we can obtain another (and rather simpler) expression for Pjp,,, that is: 


” k-1 " k-1 - (365-k+1 
Pig = i a) ieee 
we ts ( =) H( =) II ( 365 ) 


k=1 


Ih age 365! 


= I 365 —k +1) (61) 


m 


PE: Do the following: 
e Create a spreadsheet to find the minimum n for which this probability is less than 1/2. 
e Repeat this Problem for different birthmonths (instead of birthdays). 


[76] As before, we are using Eq. 55 which is an extended form of Eq. 54. 
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e Find the probability that in a group of n persons at least two persons have the same birthday. 
e Make an argument to derive Eq. 61 directly rather than obtaining it from Eq. 60. 
13. A box contains 10 balls: 4 black and 6 white. If 3 balls are drawn randomly, what is the probability 
of being all black? 
Answer: We have C} combinations for drawing 3 black balls (out of 4 black balls), and C3° combi- 
nations for drawing 3 balls (out of 10 balls). Hence: 
Gs 4 1 
P(all 3 balls are black) = C10 = 120 ~ 30 


Alternatively, if we select the 3 balls successively then the probabilities of the 1°*,2"4,3" ball to be 
black are (4/10), (3/9), (2/8) and hence by the multiplication law of probability (for dependent events 
using Eq. 54) we have: 
P(all 3 balls are black) = = a 
10 9 8 30 
PE: Repeat the Problem for drawing 4 balls which are: 


(a) All white. (b) All black. (c) 2 white and 2 black. (d) 1 white and 3 black. 


14. Let event A be a union of n pairwise exclusive events (A, A2,..., An) and B and C are other events 
in the same sample space. Prove the following: 


(a) P(A|B) =P |B) 


(b) -y P(A;) P(B|A;) (A represents the entire sample space) 
(c) P(B|C)= yr P(A; |C) PIB (Ain c)| (A represents the entire sample space) 
Answer: 
(a) 
P(ANB) = Sor mB) (Eq. 58) 
P(ANB) _y> P(AiNB) bot 
~ P(B) = Ss ~ P(B) [dividing by P(B) x 0 
P(A|B) = > P(A; |B) (Eq. 45) 


i= 


We note that this equation is called the addition law for conditional probabilities. 


(b) 


PB) = S< PAB) (Eq. 59) 
P(B) = D/ P(A;) P(BIAi) (Eq. 47) (62) 
(c) 
P(BNC) = ; PIA; A(BN c)| (Eq. 59) 
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n 


P(BNC)= Sa. (Eq. 17) 

P(BNC) =r P(C) P(A;|C) PIB (Air c)] (Eq. 54) 
ee =r P(A; |C) PIB (Ain 0)| [dividing by P(C) £0 

P(B|C) = yr A;|C) PIB (Ain ©] (Eq. 45) 


PE: What is the significance of the results of this Problem? Give some real-life examples of these 
results. 


3.4 Abstract Devices in Probability 


There are several types of abstract devices and techniques which are commonly used in tackling and solving 
probability problems. These devices are used for various purposes like explanation, illustration, proving, 
testing and verification. These include Venn diagrams and tree diagrams which are widely used for 
explanation and illustration and can be crucial for setting, clarifying and formulating the problems.!77| 
Tables and graphs (e.g. histograms) may also be used for these purposes. Computer simulation 
can also be used for verifying the solutions obtained analytically or by other methods (see § 1.6).!78!_ In 
the following Problems we demonstrate the use of these devices in probability. 


Problems 


1. Draw a Venn diagram representing the case of Problem 22 of § 3.2. 
Answer: See Figure 13. 
PE: Referring to part (b) of Problem 9 of § 3.2, draw a Venn diagram for the sample space with A 
being the set of “sum of the dice is greater than 8” and B being the set of “at least one of the dice is 
5”. 

2. Make a tree diagram for part (c) of Problem 6 of § 3.2. 
Answer: See Figure 14. 
PE: Make a tree diagram for the case of Problem 22 of § 3.2 (noting that you have more than one 
possibility). 

3. Create a simple table that represents the case of Problem 22 of § 3.2 and can be used to answer the 
questions related to that Problem. 
Answer: See Table 2. 


Table 2: The table of Problem 3 of § 3.4. 


Male | Female | Sum 
| Doctor 2 if 3 
Nurse 2 4 6 
Sum 4 5 9 


PE: Make a simple table displaying the possibilities of the outcome of a trial in which a coin and a die 
are thrown simultaneously (noting that the sample space of throwing a coin is {H,T} and the sample 
space of throwing a die is {1, 2,3, 4,5, 6}). 


[77] We note that Venn diagrams and tree diagrams were introduced in 8 2.4. 
[78] We note that simulation is used in several parts of this book including the present section. We also note that simulation 
may be used for purposes other than verification (e.g. for solving problems initially). 
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Figure 13: The Venn diagram of Problem 1 of § 3.4 (noting that D, N, M, F stand for doctor, nurse, male, 
female and each filled circle represents a member of medical staff). 


Table 3: The table of Problem 4 of § 3.4. 
11 12 13 14 #15 16 
21 22 23 24 25 (26) 
31 32 33 34 (35) 36 
41 42 43 (44) 45 46 
51 52 (53) 54 55 56 
61 (62) 63 64 65 66 


4. Make and use a table to find the probability of getting a sum of 8 in a trial of rolling two (fair and 
independent) dice. 
Answer: Referring to Table 3 (noting that “25” for instance means getting 2 on the first die and 5 on 
the second die), we note that out of 36 entries in the table we have 5 entries whose sum is 8 (i.e. the 
bold and parenthesized ones). Now, if we note that the dice are fair and independent (and hence all 
the possibilities represented by these entries are equally likely), then we can conclude (using Eq. 38) 
that the probability of getting a sum of 8 in this trial is 5/36. 
PE: Use Table 3 to find the probability of: 


(a) Getting a double, e.g. 22. (b) Getting odd number on both dice. 
(c) Getting a sum of less than 3 or greater than 8. (d) Getting a prime sum. 
(e) Getting at least one prime number on the dice. (f) Getting a sum which is even. 


(g) Getting an odd number on one die and an even number on the other. 


5. Use Table 3 to create a new table showing a sample space whose points represent the possible sums of 
the numbers on the two dice with their probabilities. 
Answer: The sums (which range between 2 and 12) and their probabilities can be obtained from the 
diagonals (from upper right to lower left) of Table 3. For instance, the sum of 4 is represented by the 
diagonal made of the elements 13, 22, and 31 and hence this sum has a probability of 3 (possibilities) 
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After 1* trial ' After 2”@ trial ' After 3"¢ trial 


Figure 14: The tree diagram of Problem 2 of § 3.4. 


Table 4: The table of Problem 5 of § 3.4. 
Sum 2 3 4 5 6 7 8 9 10 11 12 


ee. A SO Be a Be BA. ad 
Probability 35 35 36 36 36 36 36 36 36 36 36 


out of 36 (possibilities) and thus its probability is 3/36. The results are given in Table 4. 
Note: in Problem 4 we have an example of uniform sample space and in the present Problem we have 
an example of non-uniform sample space (see § 3.2). 
PE: Use Table 3 to create a new table (similar to Table 4) showing a sample space whose points 
represent the following: 
(a) Possible sums s: s<4, 4<s<6, 7<s<10, and s>1079I 
(b) Possible products a: a<6, 6<a<18, 14<a<24, and a> 24. 
6. Repeat Problem 5 but this time create a graph (histogram). 
Answer: See Figure 15. 
PE: Create graphs (histograms) for parts (a) and (b) of the PE of Problem 5. 

7. Find the probability of getting a sum of 7 or an odd number in a trial of rolling two (fair and indepen- 
dent) dice (where “odd number” means a 2-digit number made of combining the numbers of the two 
dice, e.g. if the first die is 1 and the second is 3 then we have 13). 

Answer: This Problem can be easily solved by a glimpse to Table 3 where we see 18 odd numbers 
(ie. 11,13,...,65) and 3 even numbers whose digits add up to 7 (i.e. 16, 34 and 52) and hence we 
have 21 possibilities (out of 36 possibilities) that meet the requirements.8°! Therefore, the probability 
of getting a sum of 7 or an odd number is 21/36 = 7/12. 

As a check, let use the addition law of probability (where S7 stands for sum of 7 and O stands for 


[79] Investigate the possibility of using Table 4. 
[80] We note that “odd numbers” and “even numbers whose digits add up to 7” are disjoint. In fact, we are effectively using 
Eq. 44. 
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6/36 
5/36 
4/36 
3/36 
2/36 
1/36 


2. 3 AO 61 8 9 1011. 12 


Figure 15: The histogram of Problem 6 of § 3.4. The horizontal axis represents the sums while the vertical 
axis represents their probabilities. 


odd), that is (see Eq. 43): 


P(S;UO) = P(S7) + P(O) — P(S;NO) = — 4 ee 


PE: Repeat the Problem for the following: 
(a) Getting a sum of 10 or a perfect square. (b) Getting a product of 12 or a prime number. 


8. Mention some advantages for the use of tables and graphs in presenting data and information in 
probability theory. 

Answer: Tables and graphs present data and information in a visual and compact form and hence 
they provide effective and efficient methods for understanding and communicating the problems. For 
example, if we want to find the probability of having a sum between 4 and 6 in the trial of rolling two 
(fair and independent) dice then by a glimpse to Table 4 or Figure 15 we can easily find this probability 
by summing the corresponding probabilities, ie. (3/36) + (4/36) + (5/36) = 12/36 = 1/3. 

PE: Mention some disadvantages for using tables and graphs. Also, mention some advantages and 
disadvantages for using Venn and tree diagrams and computer simulation. 

9. In a national park there are three types of big cats whose numbers and maturity state are given in 
Table 5. If one of these big cats is killed randomly (e.g. by poachers), use the table to find the proba- 
bility of being: 

(a) Juvenile leopard. (b) Mature lion. (c) Mature. (d) Juvenile cheetah OR mature leopard. 
Answer: The total number of cats is 156 + 68 + 94+ 179 + 59 + 101 = 657. We label lion, leopard, 
cheetah, juvenile, mature with L, D,C, J, M. 

(a) P(J ND) = 68/657 ~ 0.1035. 

(b) P(MN L) = 179/657 ~ 0.2725. 

(c) P(M) = (179 + 59 + 101) /657 ~ 0.5160. 

( 


d) P lJ NC)U(MN D)| = = (94+ 59) /657 ~ 0.2329 (noting that JON C and MND are disjoint). 
PE: Find the probability of being: 

(a) Not cheetah. (b) Lion OR leopard. (c) Leopard OR juvenile lion. 
( 


d) Non-cheetah juvenile. (e) Mature OR leopard. (f) Lion OR juvenile. 


3.4 Abstract Devices in Probability 


Table 5: The table of Problem 9 of § 3.4. 

Lion (L) | Leopard (D) | Cheetah (C) 
Juvenile (J) 156 68 94 
Mature (1) 179 59 101 


10. Test the result of Problem 13 of § 3.3 by simulation. 
Answer: See Box4B6W.cpp code. 
PE: Modify the Box4B6W.cpp code to simulate the results of the PE of Problem 13 of § 3.3. 


Chapter 4 
Probability Functions 


In this chapter we investigate probability functions of random variables. However, before we go through 
this investigation we need to introduce some definitions and preliminaries. 

Random variable is a variable that can take various values each of which is associated with a given 
probability. For example, the integers 1, 2,3,4,5,6 obtained in a die-throwing trial is a random variable 
where each of its values is associated with a probability of 1/6 (assuming the die is fair). Similarly, 
the weight of newborn babies is a random variable where each of its values is associated with a certain 
probability (depending on many factors like ethnicity, social and economic status, etc.). Random variable 
is of two main types: discrete and continuous. Discrete random variable is characterized by taking 
a countable number of values (e.g. the integers 1,2,3,4,5,6 obtained in a die-throwing trial), while 
continuous random variable is characterized by taking a continuous range of values (e.g. the height 
of students in a physics class).!§4! It is worth noting that continuous random variables may be treated 
as discrete (and vices versa) for various purposes and objectives (e.g. for convenience or for comparison), 
and this can take several shapes and forms. Instances of such treatment can be found in this book (as 
well as in the literature of probability in general). 

Probability distribution function (or probability distribution for brevity) is a mathematical relation 
that gives probability as a function of a random variable. It is obvious that probability distributions 
are real-valued functions (noting that probability is real). As we have discrete and continuous random 
variables, we have probability mass function which is a probability distribution of a discrete random 
variable and probability density function which is a probability distribution of a continuous random 
variable. Probability distributions of discrete random variables are usually presented graphically by 
a histogram or a bar graph (see for instance Figure 15), while probability distributions of continuous 
random variables are usually presented graphically by an ordinary graph (i.e. continuous curve), 821 

Probability distribution can be a function of a single random variable and can be a function of 
multiple random variables. However, in this chapter (and in this book in general) we generally deal 
with probability distributions of a single (real) random variable (noting that the generalization to the 
probability distributions of multiple random variables is generally straightforward). We should also note 
that any probability distribution must be normalized to unity since it represents probability. |83! 

Cumulative distribution function is a real-valued function of a real-valued random variable that 
gives the probability of the random variable taking a value less than or equal to a given value. As cumu- 
lative distribution functions represent cumulative probability they are monotonically increasing functions 
(noting that probability is non-negative). As we have discrete and continuous random variables, we have 
cumulative distributions of discrete random variables and cumulative distributions of continuous random 
variables. 

A Bernoulli trial (which may also be called binomial trial) is a random experiment that has only 
two possible outcomes (which may be labeled as success and failure). Bernoulli trials (i.e. plural which 
may also be called Bernoulli process) is a number of mutually-independent Bernoulli trials in which the 
probability of the outcomes is the same in all trials. 


Problems 


[81] Although “height” in this example (as well as many other types of continuous variable) is primarily and intrinsically 
continuous, in practical reality it is discrete due to the limits on accuracy and units of measurement. We also note that 
random variable may also be mixed, i.e. it is partially discrete and partially continuous (noting that we will not study 
this type of random variable in this book). 

[82] As indicated in the previous footnote, random variables can be mixed and so their distributions. 

[83] Tf should be obvious that normalization here means scaling the sum of all probabilities in the distribution to 1. 
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1. Propose some possible random variables for the following processes: 
(a) Throwing 2 dice once. (b) Throwing a coin repeatedly. 


(c) Observation of an animal species. (d) Observation of a type of stars. 


Answer: Examples of possible random variables are: 

(a) The sum (or product) of the two numbers obtained. The difference between the first number and 
the second number. The quotient of the first number to the second number. The absolute value of the 
difference between the two numbers. 

(b) The number of throws until an H is obtained. The number of throws until an HTH sequence is 
obtained. 

(c) Weight. Height. Age. Sex.|§4! 

(d) Brightness. Mass. Size. Surface temperature. 

PE: Repeat the Problem for the following processes: 


(a) Throwing a die repeatedly. (b) Throwing 5 coins together one time. 


(c) Observation of late-night travelers. (d) Observation of a type of chemical reaction. 


2. Discuss briefly the issue of transformation of random variables. 
Answer:!®*! This is a big and complicated issue and hence it is out of the scope of this book. However, 
in the following points we make a few remarks about this issue which we will need in the future (as 
well as being useful and potentially necessary knowledge in general): 
e A random variable can be transformed deterministically or stochastically to another variable which 
is generally a random variable too. However, our interest (as well as the interest of other authors 
in general) is about deterministic transformations using analytical functions such as linear functions. 
For example, if « is a random variable with a probability distribution function f(a) then the random 
variable y obtained from x by the linear transformation y(x) = ax + b could be of interest to us and 
we may need to know its probability distribution function g(y) and how it is related to f(z). 
e If x is a random variable and y is a random variable obtained from x by a given (deterministic) 
transformation then it is reasonable (with certain conditions) to assign the probabilities of the values 
of x to the corresponding values of y. So loosely speaking, if x; is a given element of x with a 
probability P(x) and y, is the image of x, (as obtained by the given transformation) then we can write 
generically P(y,) = P(a1). Such an assignment of the probabilities of y (i.e. from the probabilities of 
x) should (under certain conditions) preserve the characteristics and requirements of probability (such 
as normalization). 
e Let x be a continuous random variable with a probability distribution (or density) function f(x), and 
let y be a random variable obtained by a differentiable and strictly-increasing transformation function 
T and hence we can write y = y(x), i.e. y is a function of «. Noting that T is one-to-one it should 
have an inverse and hence we can also write « = x(y), ie. x is also a function of y. Now, if g(y) is 
the (presumed) probability distribution (or density) function of y then based on the previous point we 
can sensibly write: 


g(y)dy = f(x) dex 

dx 
and hence gy) = flzy)] i (63) 
where we note that x(y) is differentiable and it is explicitly expressed in terms of y (and hence the 
right hand side of Eq. 63 is purely in terms of y). This means that we can (under these conditions) 
obtain g(y) from f(x) with the help of the conditions imposed on « and y and the relation between 
them. 


As an example, let T’ be the (strictly-increasing) linear transformation y = ax + b where a > 0 and x 


[84] Random variables which are not numeric (like sex) should be expressed numerically (e.g. 1 for male and 2 for female). 
[85] The readers should be aware that for the sake of clarity we follow in this answer a simplified approach and hence it is 
not sufficiently rigorous. 
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(which is a variable over a real interval satisfying the normalization condition) has the density function 
f(x). So, from Eq. 63 we can write [noting that x = (y— b)/a and dx/dy = 1/al: 


atv) = sleeo] = I(E*) x = (4) 


a 


As asecond example, let T be the (strictly-increasing) quadratic transformation y = ax? where a > 0, 
0 <a < (3/a)!/° and @ has the density function f(x). So, from Eq. 63 we can write [noting that 


x = y/y/a and dz/dy = 1/,/4ay]: 


atv) = flew] = = (2) » Ga - a= 1(/4) 


PE: Give two more examples (like the two given examples) about the transformation of random 
variables. 

3. Provide more explanation about the term “cumulative distribution function” as used in the literature 
(or as it can be used in general regardless of the literature). 
Answer: The primary meaning of the term “cumulative distribution function” is as defined above (i.e. 
a real-valued function of a real-valued random variable that gives the probability of the random variable 
taking a value less than or equal to a given value). However, this term may be used rather differently 
to label similar types of “cumulative distribution functions”. For instance, the term may be used to 
label a “function” of a random variable that gives the probability of the random variable taking a value 
between two given values, e.g. P(1 < k < 7)./86 This type of cumulative distribution function may 
be distinguished from the “ordinary” cumulative distribution function by the label “inner cumulative 
distribution function”. In fact, we can identify several types of “cumulative distribution function” which 
include for instance: P(x < a), P(x > a) and P(a < x < b) where z is a random variable and a and 
b are given numbers. These (and other types) have a common feature of being “cumulative” since 
they represent the added probability of more than one value of the random variable (where this added 
probability is obtained by summation or integration) and hence they are “cumulative”. 
PE: List all the possible types of “cumulative distribution function” in the broad sense of this term. 
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As indicated earlier, probability mass function is defined for discrete variables. If x is a random variable 
that can assume discrete distinct values x; (¢ = 1,2,...) with probabilities P(#;) = p; then we have the 


following properties for the mass function P(z):'87 
P(n;) > 0 (64) 
P(xj Nar) =0 (G #K) (65) 
P(xj Ute) = pj + De (GG #k) (66) 
P(U; zi) =) P(zi) =; = (67) 


In the subsections of this section we investigate some of the well known and commonly used probability 
mass functions. 


Problems 


1. Justify the properties of the mass function (as expressed by Eqs. 64-67). 
Answer: 


[86] We note that if the two given values are specific then this is not really a “function”, so this label is used to represent this 
type in general considering the two limits (or values) as variables. 

[87] Probably it is more appropriate to describe Eqs. 64 and 67 as conditions (although they are properties as well) and 
describe Eqs. 65 and 66 as properties. 
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e Eq. 64 is justified by the axioms of probability (see axiom 1 of § 3.1) noting that P(a;) is a probability. 
e Eq. 65 is justified by the fact that x; and x; are mutually exclusive events and hence they are subject 
to Eq. 42. 

e Eq. oo is justified by the addition law of mutually exclusive events (i.e. Eq. 44 or rather axiom 3 of 
§ 3.1) F 

e Eq. 67 is justified by the axioms of probability (see axiom 2 of § 3.1). 

Note: from Eqs. 64 and 67 we can conclude another property for the mass function, that is P(a;) < 1 
although this should be obvious from the fact that P(a;) is a probability. In fact, this property (like 
the property of Eq. 64) can be obtained directly from the axioms of probability (see axiom 1 of § 3.1). 
PE: 

(a) Can we say that the above properties of mass function are essentially an expression or demonstration 
of the fact that mass functions satisfy the conditions set by the axioms of probability? 

(b) Give some real examples (e.g. from daily life or from science) of probability mass function, and 
show that these examples satisfy the properties of probability mass function (i.e. Eqs. 64-67). 

2. Let have a discrete random variable which has k distinct and independent outcomes x; each with 

a probability p; (¢ = 1,2,...,k% and ae p; = 1), and let have n independent trials involving this 
random variable. What is the probability P(n1,no,...,}p1,p2,---,;PR) that in n of these n trials 
we get the x; outcome, in n2 of these n trials we get the x2 outcome, ..., and in nz of these n trials 
we get the x, outcome (noting that yey ny =n)? 
Answer: Referring to Problem 28 of § 2.2, we have Cy n,n, Ways for getting “ni times of 71, ne 
times of x2, ..., and nz times of x”. Also, by the multiplication rule for independent events, each 
one of these C7?) »,,..n, Ways has a probability of py* x pj? x --- x py*. So, by the addition law for 
mutually exclusive events we have: 


P co" ny ng Nk n! x NG 68 
(N1,N2,.--, Mk} Pi, P2,+++)Pk) = Ny .Ng,..4n~, Pi XX Po” X ++ X Dy, = Gislehel lz: (68) 


Reteeay 


i=l 
Note: it is worth noting that (where the sum is taken for all nj + no +--+: +n, = 7): 
S- P(ni, ne, +++5M%k;P1,P2,--- , Dk) _ oy oh ee Nk x pi" x py” xX Dar (Eq. 68) 
= (pi + po +--+ + pe)” (Eq. 34) 
i 


=1 


So, it meets the normalization condition. 
PE: What is the relation between Eq. 33 and Eq. 34? Try to correspond the symbols in Eq. 33 to 
the symbols in Eq. 34. 

3. Give some examples of (discrete) functions that can be probability mass functions (i.e. they meet the 
conditions of Eqs. 64-67). 
Answer: The following functions can be probability mass functions because they meet the conditions 
of Eqs. 64-67: 
(a) P(k) = SEA (k= 2,3,...,12), P(k)=0 (otherwise). 
(b) P(k)= 4 (k=-1.5,-1,-0.5,...,9), P(k)=0 (otherwise). 


22 
(c) P(k) = ag@ay (k= 2,3,5,6,9), P(k)=0 (otherwise). 
(d) P(k)=& (k=1,2,...,00), P(k)=0 (otherwise). 


PE: Give more examples like the given ones. 


[88] We note that Eq. 66 extends to more than two values of the random variable, e.g. 7j.r,%, £1. 
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4, Which of the following can be a probability mass function (noting that we use P tentatively at this 
stage): 
t (k=10,11,...,30), P(k)=0 (otherwise). 
= (k=-5,-4,...,10), P(k)=0 (otherwise). 
ea? (= 0, leter noo), - 0 (otherwise). 
(d) P(k)= 38, (k=1,2,...,20), P(k)=0 (otherwise). 
Answer: 
(a) It cannot, because it does not satisfy the condition of Eq. 67. 
(b) It cannot, because it does not satisfy the condition of Eq. 64. 
(c) It can, because it satisfies the conditions of Eqs. 64-67. 
(d) It can, because it satisfies the conditions of Eqs. 64-67. 
PE: Verify the answer of this Problem by verifying the conditions of Eqs. 64-67 in each case. 
5. Find the constant c in the following probability mass functions: 
(ay PS cl" (kh 23 4 5). (by) PHC (k= L245, 6,7): 
(CUP Se2-*: Cees. 6) (d) PR) Sek (S452; 21200): 
Answer: Any probability mass function must satisfy Eq. 67 and this can be used (i.e. by summing 
the function over its entire range and equating the result to 1) to infer the value of c in each case, that is: 


(a) S3_, c2*=1 = 62c =1 = c=. 

7 2o = 266681, _ _ 176400 
(b) Davee = 1 e 1764006 = 1 = C = 266681 ° 
(ce). ee ee ad > lxc=1 > C= 1: 
(dy 3 ee eT > e=1 > c= 4. 


PE: Find the constant c in the following probability mass functions: 
(a) P(k) Se3" (RH 152734; 5,8): (b) P(k) =c 
(2). PRS er 2 8) (d) P(k) =c 


Se Ah S28 A) 
samme (2a be rn cc) 


4.1.1 Uniform Distribution 


The discrete uniform distribution (which is the simplest mass function) is given by: 


1 RD, 2 sz 
0 (otherwise) 
where x is a discrete random variable that takes n different values (i.e. 2 ,22,...,2%n). For example, 


selecting a ball randomly from a bag containing exactly n identical balls is subject to this distribution 
because the probability of selecting any specific ball is 1/n. For the uniform distribution [of the form 


P(k) = 1/n where k = 1,2,...,n] the mean y and the variance V are given by (see Problem 9 of § 5.1 
and Problem 6 of § 5.2): 


1 Pe 
p=” (eo 


2 12 ©) 


Problems 


1. Give some examples of processes and experiments whose outcome is subject to the discrete uniform 
distribution. 
Answer: 
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e Tossing a fair coin (to get H,T). 
e Rolling a fair die (to get 1, 2,3,4,5,6). 
e Drawing a card from a deck of cards. 
e Giving birth (to boy or girl). 
e Selecting a season (i.e. spring, summer, autumn, winter) at random in a year. 
e Selecting an integer at random (to get an even or odd number). 
PE: Give more examples of processes and experiments whose outcome is subject to the discrete uniform 
distribution. 
2. Show that the discrete uniform distribution is normalized. 
Answer: From Eq. 69 we have: 


n 


pe P= lal xn=l 
k k 


n 
=I, k=1 


PE: Explain and justify all the details of this derivation. 
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The binomial distribution is probably the most important discrete probability distribution. The mass 
function of this distribution (which is used to model the probability distribution in a series of Bernoulli 
trials) is given by: 


P(x =k) = P(n,p,k) = CRp*(1 — p)”-* (h= 0,154.55) (71) 


where C?? is the binomial coefficient, p is the probability of occurrence, n (which is a non-negative integer) 
is the number of trials and k is the number of occurrences (k = 0,1,...,n). For example, getting k heads 
in a trial of throwing a coin n times (where the probability of getting head in each throw is p) is subject 
to this kind of distribution. For the binomial distribution the mean yu and the variance V are given by 
(see Problem 9 of § 5.1 and Problem 6 of § 5.2): 


p= np V = np(1—p) (72) 


It is important to note the following about the binomial distribution: 
e For the binomial distribution to apply, p must be the same in all trials and the outcomes of the trials 
must be independent of each other (i.e. the trials are a Bernoulli process which we defined in the preamble 
of this chapter and indicated at the beginning of this subsection). 
e The “binomial” designation is because only two possible outcomes are considered in this type of distri- 
bution, ie. occurrence (or “success”; see next point) with probability p and non-occurrence (or “failure”) 
with probability (1 — p). The “binomial” designation is also because the probabilities are given by the 
terms of the binomial theorem (see Problem 20 of § 2.2). 
e It is common in the literature of probability to label one of the outcomes in the binomial distribution 
(which is the occurrence of the event of primary interest in that distribution) as “success” and to label 
the other outcome as “failure”. However, the reader should note that “success” (like “failure”) is a label 
rather than a real success (and hence it could be a disaster like the occurrence of death or explosion) and 
that is why we prefer to use “occurrence” and “non-occurrence” instead of “success” and “failure” (although 
“success” and “failure” are widely used in the literature). 
e The Bernoulli distribution is a special case of the binomial distribution corresponding to n = 1. However, 
the reader should be aware that “Bernoulli distribution” may be used to label the binomial distribution 
in general. In fact, the terminology about this issue (as almost about any other issue in science and 
mathematics) is not universal and hence attention is required when reading the literature. 


Problems 
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1. Describe the binomial distribution and outline its main features. 
Answer: The binomial distribution is a probabilistic model for processes made of a number of identical 
and independent trials where each trial has only two possible outcomes (which are usually labeled as 
success and failure) and the probability of these outcomes is the same in all trials. The random variable 
in the binomial distribution represents the number of occurrences (or successes) & in a given number 
of trials n (where & can take any one of the integer values k = 0,1,...,). 
PE: Suggest some reasons for why the binomial distribution is one of the most important discrete 
probability distribution functions (and possibly the most important of all). 

2. Give some examples of the binomial distribution. 
Answer: 
e Getting & evens (and 9 — k odds) in a trial of throwing a die 9 times (where k = 0,1,...,9). 
e Getting k defective items in n items (0 < k <n) manufactured in a production line whose probability 
of defect is p. 
PE: Give more examples for the binomial distribution. 

3. Derive (or rather justify) Eq. 71 using the assumptions (outlined earlier) which the binomial distribu- 
tion is based on. 
Answer: Because the trials are independent, the probability of k successes (each with probability p) 
and n—k failures (each with probability 1 —p) in a given order is p*(1—>p)"~* (see Eq. 57). Now, the 
number of different orders in which this can occur is given by C7’,,_;, = Cf (which is the number of 
permutations with repetitions where k objects are identical and n—k objects are identical; see Problem 
30 of § 2.2). So, according to the addition law of disjoint events (see Eq. 53) the total probability of k 
successes (each with probability p) and n — k failures (each with probability 1 — p) is C2p*(1— p)r-* 
which is what is given by Eq. 71. 
PE: Justify the use of the addition law of disjoint events (as given by Eq. 53) in the above argument. 

4, Calculate the following probabilities of the binomial distribution given in the form P(n,k, p): 


(a) P(12,6,0.12). (b) P(47, 33, 0.65). (c) P(276, 201, 0.91). 
(d) P(1672, 1569, 0.26). (e) P(3927, 56, 0.99). (f) P(5920, 4935, 0.43). 
Answer: From Eq. 71 we have: 

(a) P(12,6,0.12) = C4? x 0.128 x (1 — 0.12)'*-6 ~ 0.00128131373155 

(b) P(47,33,0.65) = (032 © 0.65"? «(1 =0.65)7- ~ 0.0947691317661 

(c) P(276,201,0.91) = C355 x 0.9170 x (1 — 0.91)77- 79" = 15350300806 x 107'8 

(d) P(1672,1569,0.26) = CI8%2 x 0.26159 (1 — 0.26)1672-1569 = 1.69204310134 x 10776 

(e) P(3927,56,0.99) = C393 x 0.99°° x (1 — 0.99)3977-5 ss =~ -9.99983922704 x 10-761” 

(f) P(5920,4935,0.43) = C393? x 0.434935 x (1 — 0.43)°920- 4935 ~ 1.2159880926 x 10-84 


PE: Repeat the Problem for the following: 
(a) P(59, 11, 0.36). (b) P(183, 77, 0.05). (c) P(592, 18, 0.18). 
(d) P(927, 725, 0.67). (e) P(3380, 2451, 0.22). (f) P(6964, 2369, 0.85). 


5. Show that the binomial distribution is normalized. 
Answer: From Eq. 71 we have: 


So PG) = Sopp pyr* = [a —2) +5)” = [a] =1 


where we used the binomial theorem (see Eq. 33) in the second step. 
PE: Explain and justify all the details of this derivation. 

6. Show that the probability of having at least one success in a binomial distribution is 1 — (1 — p)”. 
Answer: “Having at least one success” is equivalent to k > 0. Hence: 


P(k > 0) = P(k =0,1,...,n) — P(k = 0) =1— P(k=0) =1—C®p°(1 — p)”-° =1-(1—p)” 
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PE: Do the following: 

(a) Justify each step of the above derivation. 

(b) Show that in a binomial distribution P(n, p, &) the number of trials n required to have at least one 
success with a probability greater than or equal to P (0 < P < 1) is: 


In(1 — P) 
n> 
~ In(1—p) 
7. Investigate some of the limiting cases of the binomial distribution which may look dubious or prob- 


lematic. 
Answer: We investigate some of these limiting cases in the following points: 
en=0 with p40 and p¥ 1: in this case the above formula (i.e. Eq. 71) is valid because: 


P(0) = COp*(1—p)? ° =1xK1x1=1 


which is correct because the probability of 0 occurrence in 0 number of trials is certainty. 
en=1 with p40 and pF 1: in this case the above formula (i.e. Eq. 71) is valid because: 


P(0) = Cop*(-p)'°=1x1x (1—-p))°=1x1x(1-p)=1-p 
P(l) = Cyp'(l—p)" =1xpx (1 —p) =1xpxl=p 


which is correct because the first is the probability of non-occurrence and the second is the probability 
of occurrence. 
e p = 0: in this case the above formula (i.e. Eq. 71) becomes problematic for k = 0 because: 


PO) = Cp Op) =H=1e0 xi” “@=0;1,..:) 


So, to salvage this formula in this case we need to adopt a convention that 0° = 1. However, the 
formula should be OK for & #4 0 including k = n since in all these cases we have: 


Pk), =. Cpe dap)? SS Ce KO RIS FSO: (BH 1, 2).4050) 


which is correct because if p = 0 then the occurrence k& times is impossible (noting that k > 0 which 
means that it does occur sometimes in contradiction with p = 0). 
e p=1: in this case the above formula (i.e. Eq. 71) becomes problematic for k = n because: 


P(n) = CRpr(1—py-" =1xI" x0’ (n=0,1,...) 


So, to salvage this formula in this case we again need to adopt a convention that 0° = 1. However, the 
formula should be OK for & # n including k = 0 since in all these cases we have: 


P(k) = CPp*(1—p)"-* =CRx1*x0"*F=0 = (k=0,1,2,...,n—-1) 


which is correct because if p = 1 then the occurrence k times is impossible (noting that k <n which 
means that it does not occur sometimes in contradiction with p = 1). 
PE: Investigate and analyze other potential limiting cases. 

8. Investigate the variation of the shape of (the curvel®®! representing) the binomial distribution with the 
variation of p for a given n (say n = 100) by plotting the binomial distribution with various values of 
p (say p = 0.1, 0.3, 0.5, 0.7, 0.9) on the same graph. 
Answer: See Figure 16. As we see, as p increases the peak of the curve shifts to the right (since 
[4 = np increases with increasing p). Regarding the height of the peak we note that it decreases with 
increasing p until we reach p = 0.5 and the curve steadily flattens with this increase of p (since the area 
under the curve should remain constant due to normalization), but this trend is reversed after p = 0.5. 


[891 The use of “curve” (as well as other terms and expressions which are more appropriate for continuous distributions) in 
the case of binomial and other discrete distributions is for the sake of simplicity and clarity. 
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p=0.1 0.3 0.5 0.7 0.9 
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Figure 16: The plot of Problem 8 of § 4.1.2 (noting that n = 100 for all these curves). For clarity, we use 
solid curves instead of discrete points. Each number on the top (i.e. p = 0.1, 0.3, 0.5, 0.7, 0.9) belongs to 
the profile beneath it. 


The reason of this behavior is that the variance of the binomial distribution (see Eq. 72 as well as § 
5.2) is Y np(1 — p) which varies in this way, i.e. it increases up to p = 0.5 and then decreases after 
p = 0.5, 199 

PE: Repeat this Problem for n = 200 and p = 0.05, 0.20, 0.35, 0.50, 0.65, 0.80, 0.95. 

9. Investigate the variation of the shape of (the curve representing) the binomial distribution with the 
variation of n for a given p (say p = 0.5) by plotting the binomial distribution with various values of 
nm on the same graph. 

Answer: Noting that for the binomial distribution we have ~ = np and V = np(1 — p), it should be 
obvious that increasing n results in shifting the peak of (the curve representing) the distribution to the 
right and flattening (the curve representing) the distribution. In Figure 17 we plotted the binomial 
distribution for a number of n’s (i.e. n = 20,60, 100, 140,180). As we see, as n increases the peak of 
the curve shifts to the right and the curve becomes more flat by decreasing its height and broadening 
its width. 
PE: Repeat this Problem for other p’s (e.g. p = 0.3). 

10. Find the probability of: 
(a) Getting the number 6 three times (exactly) in a series of trials of throwing a fair die 7 times. 
(b) Getting 10 defective items (exactly) in 10° items manufactured in a production line whose proba- 
bility of defect is 0.0003. 
Answer: These are obviously instances of the binomial distribution. 
(a) Using Eq. 71 with n = 7, k = 3 and p= 1/6, we have: 


PRS Cl (3) ek ~ 0.0781 


190] We note that at p = 0.5 we have dV/dp = n(1 — 2p) = 0 and d?V/dp? = —2n < 0 which means that V has a peak at 
p= 0.5. 
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Figure 17: The plot of Problem 9 of § 4.1.2 (noting that p = 0.5 for all these curves). For clarity, we use 
solid curves instead of discrete points. Each number on the top (i.e. n = 20,60, 100, 140, 180) belongs to 
the profile beneath it. 


11. 


12. 


13. 


(b) Using Eq. 71 with n = 10°, k = 10 and p = 0.0003, we have: 
P(10) = C2 (0.0003)'° (0.9997)°99° ~ 0.0000152 


PE: Find the probability of: 

(a) Getting a tail 2 times in a series of trials of throwing a fair coin 5 times. 

(b) Getting zero defective items in 1000 items manufactured in a production line whose probability of 
defect is 0.007. 

Make a 3D plot for the binomial distribution with p = 0.5 and n = 5,6,...,15 (ie. P as a function of 
n and k). 

Answer: See Figure 18. 

PE: Repeat the Problem with p = 0.6. 

Write a simple program that calculates the probabilities of the binomial distribution. 

Answer: See the BinomialProbability.cpp code which calculates the individual probabilities of the 
binomial distribution. 

Note: to do extensive calculations of the binomial distribution (i.e. on a given range of k) conve- 
niently, we wrote another code (see the BinomialDistribution.cpp code) in which we put the core of 
the BinomialProbability.cpp code into a k loop and output the results to a file. This code is especially 
useful for doing extensive calculations in extreme cases (such as those cases involving very large or/and 
very small numbers). 

PE: Comment the BinomialProbability.cpp code explaining what each line is supposed to do. 

Obtain a formula for the binomial probability P(& + 1) in terms of the binomial probability P(k) and 
suggest useful applications and advantages of this formula. 

Answer: From Eq. 71 we have (noting that 1<k+1<7n): 


n n—-k— nl n—k— 
Pik+1) = Chip (1-p)r ®t = Gidimaokope & (aes 
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Figure 18: The plot of Problem 11 of § 4.1.2. 


p(n —k) n! nk]. p(n—k) 
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This recurrence formula is used to calculate successive binomial probabilities more easily and rapidly 
since this formula uses the previously calculated probability in the calculation of a given probability and 
hence it is more economic in terms of time and computing resources. Regarding the useful applications 
and advantages we can say (for example): 

e The formula can alleviate some of the difficulties and hardships in the calculations of binomial 
probabilities involving large numbers by avoiding overflow in the calculations since these extreme 
calculations can be done in stages and hence the large numbers are balanced by scaling them down at 
each stage by multiplying them by tiny numbers which keeps the calculations manageable and under 
control (i.e. they do not exceed the limits and ceilings imposed by the calculation resources which lead 
to overflow). 

e This formula can be especially important in the case of calculating cumulative binomial probabilities 
(see § 4.3) involving large numbers (e.g. n = 10°) where the calculation of a long series of probabilities 
is required and hence it can reduce the time and computing resources needed to do these calculations. 
e This formula can also be useful in analytical derivations and arguments where it could lead to 
simplifications or cancellations for instance. 

PE: Do the following: 

(a) Construct a spreadsheet (or write a computer code) in which you use this formula to do some 
relatively lengthy binomial distributions (e.g. with n = 150 and p = 0.65). 

(b) Suggest other useful applications and advantages of this formula. 
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4.1.3. Multinomial Distribution 


When we have a discrete random variable that has k distinct and independent outcomes x; each with 
a probability p; (where 7 = 1,2,...,k4 and eee p; = 1) and we made n trials involving this random 
variable, then the probability P(n1,n2,...,k3P1,P2,---,Pk) that in n; of these n trials we get the 2; 
outcome (where ae nm; = n) is modeled by the multinomial distribution. In fact, the multinomial 
distribution was introduced and investigated briefly (without mentioning its name) in Problem 2 of § 4.1 
(where it is used there as an introductory example or a case study for the probability mass functions). 
So, from the result of Problem 2 of § 4.1 the probability mass function of this distribution is given by: 


k k . 
! pee 

Pri; Mays 2, Me Pry Poy <5 +5 DE) = ChadgsngP1 Po Dy” = ! ! ! Il’: =n! I j (73) 
, : 1. TQ:... Np: jel jel 4: 


where n = ny + n2+---+ nz and py +p2+-::+ pr, = 1. For the multinomial distribution the mean p 
and the variance V are given by (see Problem 9 of § 5.1 and Problem 6 of § 5.2): 


L(xvi) = npi V (xi) = npi(1 — pi) (74) 


where 1 = 1,2,...,k. 

It is important to note the following about the multinomial distribution: 
e The “multinomial” label comes from the fact that the distribution is represented mathematically by the 
expansion of the multinomial theorem (see Problem 28 of § 2.2 and Eq. 34 in particular). 
e The binomial distribution (see § 4.1.2) is a special case of the multinomial distribution?! corresponding 
to k = 2 (and hence n, + ng = n and p; + pg = 1). Now, if we note that the Bernoulli distribution is a 
special case of the binomial distribution (see the notes of § 4.1.2), then the Bernoulli distribution can also 
be seen as a special case of the multinomial distribution. 
e Referring to Eq. 74, we have &k means and k variances corresponding to the k possible outcomes, i.e. 
each outcome has a mean and a variance. 
e The multinomial distribution can be considered and treated as a multivariate distribution (see § 4.4). 
However, in this book we avoid this approach and hence we mostly treat (and label) the & distinct and 
independent possibilities (i.e. 2;) as outcomes!*?! of a discrete random variable 2 which has k distinct and 
independent outcomes x; rather than being different random variables (although they can be similarly 
treated as different random variables). Our motivation for avoiding the multivariate approach is to avoid 
going through some details and complexities of the subject of multivariate distributions which is out of 
scope (although we will introduce this subject briefly in § 4.4). 


Problems 


1. Give some examples of probabilities that are subject to the multinomial distribution. 
Answer: For instance: 
e The probability of getting “1” once, “2” twice, and “3” five times in a trial of throwing 8 dice. 
e In a presidential election we have 5 candidates: the 1° candidate got 41% of the vote, the 2”¢ 
27%, the 3™¢ 15%, the 4*” 11%, and the 5” 6%. If 10 voters are selected randomly, the probability 
that exactly 2 voters of these 10 voters have voted for each candidate is subject to the multinomial 
distribution. |%%1 
PE: Give more examples of probabilities that follow the model of multinomial distribution. 

2. In a game of gambling, the player throws 3 fair dice simultaneously and he wins if the sum is not less 
than 15. Find the probability of winning. 


191] Or alternatively, the multinomial distribution is a generalization of the binomial distribution where each trial has more 
than two possible outcomes (instead of the two possible outcomes assumed in the binomial distribution) with correspond- 
ing probabilities whose sum is 1. 

“Outcomes” here should not be understood as single (or individual) outcomes but rather as types (or categories) of 
outcomes. 


[92] 


193] In fact, we should assume that the number of voters is very large (which is very realistic in a presidential election). 
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Answer: To win, he should get (3 sixes) or (2 sixes and 1 five) or (2 sixes and 1 four) or (2 sixes and 
1 three) or (1 six and 2 fives) or (1 six and 1 five and 1 four) or (3 fives). From Eq. 73 we get: 


ae TO KOKOROKOKORO) 
resimwtits) = chraus(’)' (CGY) C90) 
ramantrin) = Ganina(3) (2) (8) GG) G20) 
resents) = nsaes(8) (2) 0) GY 20) 
rascamate) = Gasnas(3) CYC) CYC) (90) 
mscamteieton) = enss(®)' (CY CC) @) =) 
er ONOKOROROROIO) 


Now, these events are pairwise exclusive and hence from Eq. 53 we get: 


i ‘ee ry? i" 
Probability of win = 2 (=) +4x3 (=) +6 (=) = 20 (=) ~ 0.09259 


PE: Do you notice anything odd about the relation between n and & in the multinomial distribution 
formula (i.e. Eq. 73) which we used in the solution of this Problem? 

. Verify that the probability in the example of Problem 2 is normalized (i.e. the sum of the probabilities 
of all the possible outcomes is unity). 

Answer: When we throw 3 fair dice simultaneously we have three main possibilities: 

e All dice show different faces. The number of cases for this possibility is P§ = 120 (because we are 
choosing 3 different faces out of 6). Now, the probability of each case is (1/6)* and the cases are 
pairwise exclusive, and hence the total probability of this possibility is 120 x (1/6)%. 

e Only two dice show different faces. The number of cases for this possibility is P? x C? = 30 x 3 = 90 
(because we are choosing 2 types of different faces out of 6 which is P§ = 30; moreover one type is 
repetitive and hence we have C? = 3 options for choosing the 2 repetitive faces out of the 3 faces), |941 
Now, the probability of each case is (1/6)? and the cases are pairwise exclusive, and hence the total 
probability of this possibility is 90 x (1/6)%. 

e All dice show identical faces. The number of cases for this possibility is obviously 6 (because the 
faces could be 1 or 2 or 3 or 4 or 5 or 6). Now, the probability of each case is (1/6)? and the cases are 
pairwise exclusive, and hence the total probability of this possibility is 6 x (1/6)°. 

On summing the probabilities of these three exhaustive and disjoint possibilities we get: 


iy? 1h? 1\? 216 
120 x { = 90 x { = 6x(-) == =1 
«(5) +9x(5) +6*(5) = ite 


PE: Justify the normalization shown in this Problem using a multinomial approach. 
. Calculate the following probabilities of the multinomial distribution given in the form: 


V1 MUD yee Mk __ n 21 N22 tk 
P, yoDk = Chine See npP1 P2 Pr 


[94] Alternatively, we have 3 possibilities for the face that is different from the other two faces. 


4.1.3 Multinomial Distribution 97 


where n=ny+tno+...+ np and py + pot...+ pp, =1: 
6,9,15 12,0,21,2,15 8,2,65,1,33,99 

(a) F.34,0.39,0.27° (b) P.11,0.19,0.35,0.25,0.10- (c) Po.03,0.26,0.12,0.36,0.08,0.16" 
3,19,17,5,6,108 14,53,19,51,66,79 9,83,78,197,150 

(d) PO'2,0.1,0.3,0.07,0.1,0.23" (e) Po .11,0.2,0.18,0.06,0.43,0.02° (f) F0.13,0.09,0.69,0.06,0.03" 


Answer: From Eq. 73 we have: 


(a) Py'34.0.39,0.27 = C8915 0.34% 0.39° 0.27"° 
~ 0.000739565021175 
(b) Po77'0.19.0,35,0.25,0.10 = C?2.9,21,2,15 0-11"? 0.19° 0.357" 0.257 0.10" 
~ 2.48247886983 x 107'4 
(c) P9.0310:26.0:1.0.36,0.08,0.16 = Cs.2.65,1,33,99 0-02° 0.26” 0.12°° 0.367 0.08°° 0.16% 
~ 3.99728329044 x 107~*° 
(d) La UE R eee = C3°19,17,5,6,108 0.27 0.1" 030107" 0.1" 0.280" 
~ 1.26258419896 x 107%? 
(e) Por 02.0180 oe UaaOes = C72%53,19,51,66,79 0-11"* 0.2°° 0.18" 0.06°* 0.43°° 0.027 
~ 1.77512819128 x 107°? 


(f) Poi uie a be auewae - C3'83,78,197,150 0.13° 0.09° 0.69"* 0.067 0.031°° 


~ 451844246930 x 107273 


PE: Repeat the Problem for the following: 


335 159,1;6 150,0,23,33,12,8 71,22,13,36,81,3 
(a) P.07,6.24,0.49,0.15,0.05° (b) P.23,0.15,0.16,0.14,0.13,0.19" (c) P5.33,0.08,0.05,0.26,0.15,0.13° 


5. Show that the multinomial distribution is normalized. 
Answer: From Eq. 73 we have: 


. — n n n n 
s PG Nd505 25ND tyP24-0455 DR) aa ) Cr ean Da: 1° De 
V nitnete-+np=n V nitnet+tnp=n 


= (pitpet--- +p)” = (1) =1 


where we used the multinomial theorem (see Eq. 34) in the second step and used the fact that 
Sasa p; = 1 in the third step. 
PE: Explain and justify (verbally) all the details of this derivation. 
6. Find the probability for: 
(a) The first example of Problem 1. (b) The second example of Problem 1. 


Answer: We use in this answer the form Ppityh2r 7k. 
(a) Using Eq. 73 with k = 6 (corresponding to the 6 possible outcomes of each die), n = 8, n1 = 1, 


ng = 2, ng = 5, n4 = N5 = Ng = O, 1 = P2 = D3 = Pa = Ds = Do = 1/6, we have: 


Le IN ee Lae 
1,2,5,0,0,0 te 
Pie i641 761 /erjetje = C?.25.0,0,0 X (=) x (=) x (z) x (=) x (=) x (=) ~ 0.0001 


(b) Using Eq. 73 with k = 5 (corresponding to the 5 candidates), n = 10, ny = ng = n3 = ng = Ns = 2, 
py = 0.41, po = 0.27, p3 = 0.15, pa = 0.11, ps = 0.06, we have: 


Pree tie oes SO pany 0a Sar SOS 01x 0.06" = 0500136 


PE: Repeat: 


(a) Part (a) of this Problem for Poe et ea waseaie: 


(b) Patt. (b) of this Problem for Py 7/437 ois. ouiiol00: 
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7. Write a simple program that calculates the probabilities of the multinomial distribution. 
Answer: See the MultinomialProbability.cpp code which calculates the individual probabilities of the 
multinomial distribution. 
PE: Describe the method which the MultinomialProbability.cpp code uses to calculate the probabilities 
of the multinomial distribution. 


4.1.4 Poisson Distribution 
The Poisson distribution is given by: 


Me-A 


P(w =k) = P(A, k) = 


(k = 0,1,2,...) (75) 
where > 0 is the Poisson parameter and & is the number of occurrences. The Poisson distribution is 
commonly used to model the probability of an event occurring a given number of times k within a given 
time period (or/and a given spatial region). For example, having k decays in a minute by a given sample 
of radioactive material is subject to this kind of distribution. For the Poisson distribution the mean yu 
and the variance V are equal to X, that is (see Problem 9 of § 5.1 and Problem 6 of § 5.2): 


w=V=X (76) 


It is important to note the following points about the Poisson distribution: 
e The Poisson distribution is based on the assumption that the probability of occurrence is constant 
throughout the process and the events are independent of each other. In fact, there are other assumptions 
related to the size of event rate and the total number of events. 
e The Poisson distribution can be seen as a limiting case for the binomial distribution when the number 
of trials n becomes large (n — oo) and the probability of occurrence p becomes small (p > 0) such that 
\ = np stays finite and constant (see Problem 2). Therefore, the Poisson distribution can be used (and 
is used) as a substitute for the binomial distribution in this case (where the relative ease of calculating 
the Poisson compared to calculating the binomial makes the shift from the binomial to the Poisson 
advantageous). 
e The specifications and conditions given in the previous point are rather generic and loose and they 
require more explanation and details (noting for instance that other factors like k and np affect the 
quality of the approximation of binomial by Poisson). So, a case-by-case assessment is required (or at 
least recommended) before using the Poisson distribution (primarily or as an approximation) to model 
a given probability problem. However, these details are not important to us and hence we ignore them 
(noting that these issues are not treated fairly and properly in the literature). 
e Referring to the previous points, an excuse that is usually presented in the literature for shifting from 
the binomial to the Poisson (as an approximation to the binomial) is that calculating C? in the binomial 
formula could become difficult when n is large. However, we note that the computers (associated with 
mathematical software packages and programming languages) these days usually avoid this difficulty. 
Nevertheless, the Poisson formula involves less (and usually smaller) factorials than the binomial formula 
and hence it offers more efficiency in computation and less difficulty in calculation and these factors are 
generally advantageous especially in the cases of extreme calculations which involve large numbers or/and 
many cases (e.g. calculating probabilities involving factorials of integers of order 104 for 10° times where 
computation time becomes significant even on modern computers). 


Problems 


1. Give some examples of probabilities modeled by the Poisson distribution. 
Answer: 
e Getting a given number of visitors to a clinic during a day. 
e Getting a given number of visitors to a website during a week. 
e Having a given number of infections by corona virus in a city during 2024. 
e Having a given number of decays in a given time period by a given sample of uranium. 
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e Observing a given number of spiral galaxies in a certain zone of the sky dome. 
PE: Give more examples of the Poisson distribution. 

2. What is the most distinctive feature that distinguishes the Poisson distribution from the binomial 
distribution (noting that both are discrete)? 
Answer: It is the fact that while the number of trials n in the binomial distribution is a given finite 
constant, the number of trials in the Poisson distribution is not identified or limited (and hence in the 
binomial distribution we have k = 0,1,...,n while in the Poisson distribution we have k = 0,1,2,...). 
In fact, this is related to what we stated in the second point in the preamble of this subsection. 
PE: According to the literature, the Poisson distribution represents probabilistic situations in which 
discrete events occur independently in a continuum at a rate of A. Try to justify this. 

3. Justify Eq. 75 (assuming that Poisson is a limit for the binomial when n — oo and p + 0 such that 
\ = np stays finite and constant). 
Answer: Based on the given assumption, we start from Eq. 71, that is: 


PR) = Oph —p)"* = pot — py! (K=0.1.n) (77) 


~ Kl(n—k 
Now, if we assume that k < n (noting that n — oo) then we have: 


n! 


Gam eee me) 


Moreover, if we note that p — 0 then we have (noting that p = /n): 


(1—p)**= ae pce ere (79) 


where we used the identity e” = limp (1 + 2)" in the third step. On substituting from Eqs. 78 
and 79 into Eq. 77 we get (noting that p = X/n): 


k k,—-Xd 
P(k) =~ nk (2) pues (k =0,1,2,...) 


k! 


which is Eq. 75. 
PE: By inspecting and analyzing the above justification (and the stated assumptions in particular), 
try to set some broad rules (mainly of practical nature) about when the Poisson distribution can be 
used as a good approximation to the binomial distribution. 

4. Calculate the following probabilities of the Poisson distribution given in the form P(k, ): 


(a) P(4,6.7). (b) P(11, 392.7). (c) P(481, 2.8). 

(d) P(663, 7.9). (e) P(7829, 6961.5). (f) P(9947, 5915.3). 
Answer: Using Eq. 75 we get: 

(a) P(4,6.7) ~ 1.033511 x 1071. (b) P(11, 392.7) ~ 2.432603 x 1071°°. 
(c) P(481, 2.8) ~ 8.139821 x 107879. (d) P(663, 7.9) ~ 1.444462 x 10793. 

(e) P(7829, 6961.5) ~ 1.254247 x 107°. (f) P(9947, 5915.3) ~ 2.273048 x 107497. 
PE: Repeat the Problem for the following: 

(a) P(11,9.1). (b) P(37, 222.4). (c) P(3611, 56.8). 
(d) P(5390, 153.7). (e) P(829, 8421.7). (f) P(9381, 7429.4). 


5. What is the effect of varying \ on the shape and position of (the curve representing) the Poisson 
distribution? 
Answer: Noting that for the Poisson distribution we have yw = V = A, increasing 4 results in 
shifting the peak of (the curve representing) the distribution to the right and flattening (the curve 
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Figure 19: The plot of Problem 5 of § 4.1.4. For clarity, we use solid curves instead of discrete points. 
Each number on the top (i.e. A = 10, 30,50, 70,90) belongs to the profile beneath it. 


representing) the distribution. In Figure 19 we plotted the Poisson distribution for a number of 1’s 
(i.e. \ = 10,30, 50, 70,90). As we see, as X increases the peak of the curve shifts to the right (because 
js = 2) and the curve becomes more flat by decreasing its height and broadening its width (because 
V = d).195 

PE: Compare this behavior to the behavior of the corresponding binomial distribution (see Problems 
8 and 9 of § 4.1.2) noting that for the binomial distribution ~ = np and V = np(1—p). Try to link the 
two behaviors to the relationship between the two distributions noting that for the Poisson distribution 
A=p=V. 

6. Show that the Poisson distribution is normalized. 
Answer: From Eq. 75 we have: 


ed de —X ae! de 
peste Ss ri Se ere 
k k=0 k=0 


where we used the exponential series in the third step. 
PE: Explain and justify (in words) all the details of this derivation. 

7. A (large) specimen of a radioactive material (with very long half-life) is recorded to emit 15920 S-ray 
particles during one hour. Plot the probability of having & emissions per second for k = 0,1,...,15 
(assuming a Poisson distribution). 

Answer: We have \ = jz = 15920/3600 ~ 4.42222 emissions per second. Using Eq. 75, we calculated 
P(k) for k = 0,1,...,15 and plotted the results in Figure 20. 
PE: Repeat the Problem for 34197 emissions in 5 hours. 

8. A population of 1592738 individuals were given a vaccine whose probability of causing blood clots is 

p = 0.0000053. Plot the probability of having k cases of blood clotting (for k = 0,1,...,20) in this 


195] 4 feature of the Poisson distribution (which cannot be easily noticed in Figure 19) is that as increases the shape of 
the curve becomes more symmetric (about its value at the peak). 
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Figure 21: The plot of Problem 8 of § 4.1.4. 


process of mass vaccination (assuming a Poisson distribution). 

Answer: We have \ = ps = np = 1592738 x 0.0000053 = 8.4415114 cases of clotting.°] Using Eq. 
75, we calculated P(k) for k = 0,1,...,20 and plotted the results in Figure 21. 

PE: Repeat the Problem for 2372042 individuals with probability of clotting p = 0.0000037. 


. Similar to what we did in Problem 13 of § 4.1.2, derive a recurrence formula for the Poisson distribution. 


Answer: From Eq. 75 we have: 


P(k) 


k+1,—\ ko—A 

ee ee a . AS |- d 
(k+1)! (k +1) k} (k +1) 

PE: Suggest useful applications and advantages of this recurrence formula. 

Give an example showing that the Poisson distribution is generally more convenient in calculation than 

the corresponding binomial distribution and hence justify the use of Poisson as an approximation to 

the binomial (when the approximation of binomial by Poisson is valid). 


[96] As we see, 4 = np should remind us that the binomial is in the background. 
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Answer: For example, let have a binomial probability problem where we have to calculate the prob- 
ability P(n,p, ki <k < ko) = P(200,0.05,6 < k < 8). Now, the binomial probability is: 


P(200,0.05,6<k<8) = C€3°°0.05°0.95'%4 + C7°° 0.057 0.95798 + C3°° 0.058 0.9579? 


= MOOT 2 tin aly DOOR anata t abs. SON a2 seo cee 
= ariogi 0:08" 0.95" + oa 0.05" 0.9519 + oe 0.05% 0.95 


0.0614 + 0.0896 + 0.1137 ~ 0.2647 


l2 


On the other hand, the corresponding Poisson probability P(A, ki < k < ke) is (noting that A = np = 
10): 


6! 7! 8! 
0.0631 + 0.0901 + 0.1126 ~ 0.2657 


TO? 2-106 = “hos 
P(10,6<k<8) = an(e + =) 


l2 


As we see, the Poisson probability requires much less calculations than the corresponding binomial 
probability and the results are very similar with a minute relative percentage error of about 0.4% 
(noting that the conditions for the validity of this approximation are generally satisfied in this example). 
PE: Repeat the Problem by giving another example. 

11. Write a simple program that calculates the probabilities of the Poisson distribution. 
Answer: See the PoissonProbability.cpp code which calculates the individual probabilities of the Pois- 
son distribution. 
Note: to do extensive calculations of Poisson distribution (i.e. on a given range of k) conveniently, 
we wrote another code (see the PoissonDistribution.cpp code) in which we put the core of the Pois- 
sonProbability.cpp code into a k loop and output the results to a file. This code is especially useful 
for doing extensive calculations in extreme cases (such as those cases involving very large or/and very 
small numbers). 
PE: Comment the PoissonProbability.cpp code explaining what each line is supposed to do. 


4.1.5 Geometric Distribution 


The random variable in this probability distribution represents the number of (binomial or Bernoulli) 
trials required to obtain the first success. Accordingly, this distribution may be regarded as a special 
case of the binomial distribution (although this should be understood as a similarity rather than being so 
literally). It should be obvious that this distribution is given by: 


P(x =k) = P(p,k) = (1—p)*"!p (7 eaors Meee (80) 


where k is the number of trials required to obtain the first success and p is the probability of success 
(0 <p <1).97 For the geometric distribution the mean py: and the variance V are given by (see Problem 
9 of § 5.1 and Problem 6 of § 5.2): 


L= V= 
Pp Pp 


(81) 


It is important to note the following about the geometric distribution: 
e The “geometric” label comes from the fact that the terms of this distribution represent a geometric 
sequence where each term is obtained from the previous one by multiplying it by a factor of (1 — p). 
e There is another form of the geometric distribution which is used for representing the number of fail- 
ures before the first success, and this form requires slight modifications to the above formulations and 
conditions. So, the readers should be aware of this to avoid confusion (see Problem 7). 
e The geometric distribution is a special case of the negative binomial distribution (which will be investi- 
gated in § 4.1.6) corresponding to r = 1 (noting the difference in & between the two distributions). 


197] The latter condition may be stated as 0 < p < 1 but we prefer to exclude the case of p = 1. 
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e The geometric distribution is commonly used to model the waiting time or the lifetime in probabilistic 


processes. 


[98] 


Problems 


1. 


Justify Eq. 80. 

Answer: The random variable in the geometric distribution represents the number k of binomial 
trials required to obtain the first success. This means that we should have k — 1 failures (each with 
probability 1—p) before we get the first success at the k“” trial (where the probability of success is p). 
Hence, by the multiplication law of independent events (see Eq. 57) the probability of obtaining the 
first success at the k'” trial should be (1 — p)*~!p, which is what Eq. 80 states. 

PE: What is the mass function of a probability distribution in which the random variable represents 
the number of binomial trials required to obtain the first failure? 

Give some examples of processes that can be modeled by the geometric distribution. 

Answer: 

e Tossing a coin until a tail is obtained (where success is defined as “getting tail”). 

e Throwing a die until 4 appears (where success is defined as “getting 4”). 

PE: Give more examples of processes that can be modeled by the geometric distribution. 

Show that the geometric distribution is normalized. 

Answer: From Eq. 80 we have: 


SS P(k) = S(1—p)*"'p=pS_(1—-p)*"' =px ae =px-=1 


k k=1 k=1 1-(1-p e 


where we used the well-known geometric series formula |i.e. S = a;/(1—r)] in the third equality. 
PE: Why the series }>7°_,(1 — p)*~!p should be convergent? 

What is the effect of varying p on the shape of (the curve representing) the geometric distribution? 
Answer: We note that the first value of this distribution is p [ie. P(1) = p] and hence the curve 
representing this distribution starts high for high p (and low for low p). Now, since the area under 
the curve should be unity (due to normalization), we should expect the curve to drop faster for high 
p (and slower for low p).|99 These features are evident in Figure 22 where we plotted the geometric 
distribution for p = 0.1, 0.3, 0.5, 0.7,0.9 between & = 1 and k = 10. 

PE: Plot similar curves for p = 0.2, 0.4, 0.6, 0.8. 

In a game of chance, the player throws a die and a coin simultaneously until he gets a given face of 
the die (say 6) and a given face of the coin (say head H) simultaneously.°! Find the probability 
distribution of this game and calculate the chance of winning (where win is determined by getting 6 
and H before the fourth trial). 

Answer: This is an instance of the geometric distribution with: 


p= P(6N A) = P(6) x P(A) = 


where we use the multiplication law of independent events (see Eq. 50) since the events of getting 6 
and getting H are independent. Accordingly, the probability distribution for this game is (see Eq. 80): 


k-1 
Pa)=C-py= (1-5) x5 


198] This may be justified by the property of the geometric distribution (which is shared also by the exponential distribution; 


see § 4.2.3) that it is “memoryless”. 


199] For clarity and simplicity we use a language more appropriate for continuous distributions (e.g. using “curve” and “area”). 


Alternatively, we may say (using a language more appropriate for discrete distributions): the successive values of this 
distribution (corresponding to successive k) are obtained by multiplication with (1 — p) < 1 and hence larger p means 
smaller (1 — p) and hence faster drop while smaller p means larger (1 — p) and hence slower drop. 


[100] Simultaneously means getting (6 AND H) at the same trial. 
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Figure 22: The plot of Problem 4 of § 4.1.5. For clarity, we use solid curves instead of discrete points. 


The chance of winning is represented by the probability of getting 6 and H (simultaneously) in the 1°¢ 
or 2"4 or 3”¢ trial, and hence it is obtained by summing these probabilities (of these disjoint events), 
that is: 


, a i eae es eB nb as 

P(win) = = (1 3) XG ae Goa Paes Te 
PE: Repeat the Problem assuming now that win is determined by getting (simultaneously) an even 
number on the die and H on the coin before the third trial. 

6. In a game of chance, the player throws a die and a coin simultaneously until he gets a given face of 
the die (say 6) and a given face of the coin (say head H) simultaneously or separately.(°!_ Find the 
probability distribution of this game and calculate the chance of winning (where win is determined by 
getting 6 and H before the fourth trial). 

Answer: The probability of getting 6 and H (simultaneously or separately) at the k'” trial can be 
obtained from the sum of the following probabilities (noting that the events of these probabilities are 
mutually exclusive): 

e@ P(Aat & Gat or before k) Which is the probability of getting the first H at the kt” trial and getting 6 
at or before the k*” trial. 

© P(6at &OApefore &) Which is the probability of getting the first 6 at the k’” trial and getting H before 
the k*” trial (noting that k > 1 in this case because there is no “before” in the first trial). 

Now, P( Hat ¢ M Gat or before k) iS given by: 


P(HAat BM Bat or before k) = P(Aat k) x P(6at or before k) (Eq. 50) 


0-2) EG] em 
-()EC-a 
-()°> [i 6)] 


[101] Separately means getting the first 6 (or H) at the k*” trial and getting H (or 6) once or more before the k*” trial. 
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= Gy x E x — = (geometric series) 


- (5) «['-(@) |-S" G40 3) 


Similarly, P(6at & M before &) is given by (noting that & > 1): 


P(Gat &O before &) = P(Gat x) x ae ) | (Eq. 50) 
(0-3 *[(e6-2) 7 aaa 
|G) af*[EG-3) 4 
-|@) J EQ] 
= {(8) 7] ee ees 


7 G) | ; f G) | = ee te (k = 2,3,...) 


Accordingly, the probability of getting 6 and H (simultaneously or separately) at the k*” trial is [noting 
that P(6at kM Hbefore ) as given by the last equation is equal to 0 for k = 1): 


6* A 5k 19*-1 _ 5k-1 
P(k = PH, 6, or before P(6a 1 Abefore = k=1,2,... 
(k) (Hat & 1 6at or before &) + P(6at & O Abetore &) 9K ET ST ( ) 


Regarding the chance of winning, it is represented by the probability of getting 6 and H (simultaneously 
or separately) in the 1% or 2" or 3" trial, and hence it is obtained by summing these probabilities 
(of these disjoint events), that is: 


3 
: 6* a 5k 19*-1 _ 5k-1 
ey SB ( 1k 6 x 12h=1 ) 
1 6236? 10 = 5 6S be 10? SGP. 637 
— | | => YU. 4 
tot HE 6xl2* 128 éxie is 
Note: this distribution is normalized, that is: 
s BS oP OP SBN Ph O05 Sig, 
f+ \. 12k i Be) ee ae a cae | Cs an Cl 


PE: Repeat the Problem assuming now that win is determined by getting (simultaneously or separately) 
an odd number on the die and T on the coin before the fifth trial. 
7. Outline the other form of the geometric distribution (which was indicated earlier in the notes). 
Answer: We have (noting that in this form k represents the number of failures before the first success): 
l—-p l-—p 
P(p,k) = (1—p)*p aa —_——— 
(p,k) = (1p) We B 

where k = 0,1,2,...and0<p<l. 
PE: Justify the formulations and conditions given in this Problem for the other form of the geometric 
distribution (ignoring pz and V). 
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4.1.6 Other Discrete Distributions 


There are many other discrete distributions some of which are outlined in the following points: 

e Negative binomial distribution: this distribution represents the probability of the number of failures 
before the r‘” success in a series of Bernoulli (or binomial) trials, e.g. the probability in a series of coin- 
tossing trials of having 5 H (where H represents failure) before obtaining 3 T (where T represents success). 
If r represents the number of successes (each with probability p) and k represents the number of failures 
(each with probability 1 — p) then the mass function of this distribution is given by: 


P(k, rp) = Cet? p” (1 —p)* (82) 
where k = 0,1,... and r = 1,2,... while 0 < p< 1. The mean and variance of this distribution are given 
by: 

r(l—p ge a 
a r=2) y= 7=2) is 
Pp Pp 


As indicated earlier, the geometric distribution (see § 4.1.5) is a special case of the negative binomial 
distribution corresponding to r = 1. We also note that the “negative binomial” label comes from the fact 
that the coefficient Coa? in the equation of this distribution (see Eq. 82) is equal to the magnitude 
ay Organ oa Geiger |Czr"|) which is the binomial coefficient in the binomial theorem expansion for 
negative integer powers (see Problem 46 of § 2.2). 

e Hypergeometric distribution:!!°7! this distribution is similar to the binomial distribution but while 
the trials in the binomial are independent of each other and hence they have a fixed probability (i.e. 
they are Bernoulli trials), the trials in the hypergeometric are not independent (and hence they are not 
of Bernoulli type). In brief, the hypergeometric distribution represents the probability of the number of 
successes in a series of random trials in which objects are drawn with no replacement from a population 
(consisting of two types) and hence the probability of success (represented by one type) and failure 
(represented by the other type) in each draw depends on the outcomes of the previous trials. For example, 
if we draw consecutively with no replacement a number of balls from a box that contains given numbers 
of blue and red balls (where drawing blue is a success and drawing red is a failure) then it is obvious that 


the probability of drawing blue and red in each trial depends on the outcomes of the previous trials.!!°] 
The mass function of this distribution is given by: 
CRON-R 
P(N, n, yr) = Tay (84) 


where JN is the size of the population from which the choices are made, n is the number of trials made, 
R is the number of available successes within N, and r is the number of actual successes made in the 
trials.°41 The mean and variance of this distribution are given by: 


2 —QGYC-A 


Problems 


1. Justify Eq. 82. 
Answer: We have k++, trials (since we have & failures and r successes) where the last trial is a definite 


[102] The “hypergeometric” label comes from its moment generating function. 

[103] For instance, if the box contains N balls (R blue and N — R red) and we draw two balls then if the first draw is blue 
then the probability of drawing blue in the second draw is (R — 1)/(N — 1), while if the first draw is red then the 
probability of drawing blue in the second draw is R/(N — 1). It is worth noting that being “consecutive” is for clarity 
rather than being a condition (i.e. this distribution applies even if the balls are drawn in one go noting that in this 
case the probability of the color of each ball in the selection depends on the color of the “previous” balls in the selection 
where “previous” here means “in consideration” rather than “in time”). 

[104] We note that all these numbers are finite. Moreover, 0 < R < N and 0 <r < R restricted by r <n. The readers 
are referred to Problem 6 of the present subsection and to Problem 2 of § 7.4 for examples of the hypergeometric 
distribution. 
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success. Hence, the number of permutations with repetitions (see Problem 30 of § 2.2) for the failures 
and successes (noting that the last trial is a definite success and hence we have k repetitive failures and 


r — 1 repetitive successes that contribute to the permutations) is Coe. aS Cor’: Moreover, the 


probability of each one of these permutations is p"(1— p)* because we have r successes and k failures. 

Therefore, by the addition law of disjoint events (see Eq. 53) the total probability of r successes (each 

with probability p) and & failures (each with probability 1 — p) is Ces p’ (1—p)* which is what is 

given by Eq. 82. 

PE: Justify the use of the addition law of disjoint events (as given by Eq. 53) in the above argument. 
2. Show that the negative binomial distribution is normalized. 

Answer: From Eq. 82 we have (where for brevity and clarity we use gq = 1 — p): 


Pie) = Or gg 
k=0 k=0 
=p" S- Coes q* (r is fixed) 
=0 
=p" SC ) Cktr mi Ng york 
k=0 
=p" (-q+1)" (Eq. 35) 
is 1 
=p 
(l=@)r 
ee (1-q=p) 
=p —q=p 
p 
=1 


PE: In the above derivation we treated r as fixed and & as variable. Justify this. 

3. Justify Eq. 84. 
Answer: Let justify this equation by using an example where we have a pack of N cards R of which 
are blue and the rest (i.e. N —R) are red. Now, if we draw n cards (n < N) then our chance of getting 
r blue cards (r < R) should be CRON; / Cn. T ae is because there are CP? ways for drawing r blues 
out of the available R blues, and hee: are ON; ways for drawing (n — r) reds out of the available 
(N — R) reds, and hence by the fundamental principle of counting (see § 2.2) there are CRCN>” ways 
for drawing r blues and (n — r) reds out of the available N cards. Moreover, we have a total of CX 
ways for drawing n cards out of the available N cards. So, by the definition of probability (see § 3.2 
and Eq. 38 in particular) the probability of getting n cards (r blue and n —r red) out of the available 
N cards (R blue and N — R red) should be CRCN>”/CN which is what is given by Eq. 84. 
PE: Obtain an expression for P(N,n, R,r) in ferns. of factorials. 

4. Show that the hypergeometric distribution is normalized. 
Answer: From Eq. 84 we have: 


“ ". CRON * 1 RON-R _ Ne 
SIPO aR) = aN - or (ens = oy Oh = 


r=0 r=0 


where we used the Vandermonde identity (see part g of Problem 25 of § 2.2) in the third step. 
PE: Investigate other methods for showing the normalization of the hypergeometric distribution. 

5. Regarding the hypergeometric distribution, it is claimed that if n is negligible in comparison to N, 
R and (N — R) then the hypergeometric distribution P(N,n, R,r) can be replaced by the binomial 
distribution P(n,r,p) where p= R/N. Justify this claim. 

Answer: This claim sounds logical because the probability of success at a given trial is (R—s)/(N —t) 
(where s and t represent respectively the number of successes and trials made before that trial) and 


4.2 
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hence if n is negligible in comparison to R and N then s and ¢ are also negligible in comparison to R 
and N and thus (R — s)/(N — t) can be approximated by p = R/N which is the (virtually) constant 
probability of success. Similarly, the probability of failure at a given trial is (NV — R— f)/(N — t) 
(where f and t represent respectively the number of failures and trials made before that trial) and 
hence if n is negligible in comparison to (N — R) and N then f and t are also negligible in comparison 
to (N — R) and N and thus (N — R— f)/(N —t) can be approximated by ¢ = (N — R)/N which is the 
(virtually) constant probability of failure. Accordingly, the probability of both success and failure are 
(virtually) constant (where they add up to unity) and hence the trials are (virtually) independent of 
each other which means that they can be treated as Bernoulli trials (see the preamble of the present 
chapter) and thus their probability can be modeled by the binomial distribution as an approximation 
to the hypergeometric distribution. 

Note: an obvious consequence of the above result is that when we sample from a very large population 
it does not make a difference (or rather considerable difference) whether we sample with or without 
replacement (as long as our sample is tiny in comparison to the population as given by the above 
conditions). This is because in both cases we can use the binomial distribution (i.e. as an exact 
model in the case of replacement and as an approximation to the hypergeometric in the case of no 
replacement). Also see Problem 4 of § 1.5.4. We also note that the binomial distribution is generally 
easier to evaluate than the hypergeometric distribution because the binomial has only one binomial 
coefficient while the hypergeometric has three. 

PE: Create a table or a plot in which you compare (by an example) the binomial distribution to the 
hypergeometric distribution where the former can approximate the latter (i.e. by satisfying the above 
conditions). 


. The player in a game of chance draws successively and with no replacement 5 balls at random from a 


box containing 4 red balls and 6 blue balls. What is the probability of drawing 0, 1, 2,3, 4 red balls?! 
Answer: This is an instance of the hypergeometric distribution and hence we use Eq. 84 (noting that 
N=10,n=5, R=4 andr =0,1,2,3,4), that is: 


P(10;5,4,0)' = -CgC2/C5° = (1&6) /252 ~ 0.02381 
P(10,5,4,1) = C{C§/C3° = (4 x 15)/252 ~ 0.23810 
P(10,5,4,2) = C302/C3° = (6 x 20)/252 ~ 0.47619 
P(1G;5: 453) -=- OSCR} G29 = (4x15) /252 025810 
P(10,5, 4, 4) GiCh/Ce = (1 6)/252- 0.02381 


As we see, the sum of these probabilities is 1 as it should be.|! 
PE: Repeat the Problem with N = 15,n=7, R=3 and r=0,1,2,3. 


4.2 Probability Density Functions 


As indicated earlier, probability density function is defined for continuous variables as a real function f(x) 


tha 
pro 


t gives the probability value as a function of the (continuous) random variable 2, i.e. f(x;) dx is the 
bability that x lies in the interval 7; < «# <a;,+ dzx.°7l The two main properties of the probability 


density function (which are similar to corresponding properties of the probability mass function; see Eqs. 
64 and 67) are: 


SY 
— 
= 
IV 

oO 


(—oo < x < +00) (86) 


+00 
/ f(z)dx = 1 (87) 


[105] 


[106] 
[107] 


We note that “successively and with no replacement” is for clarity rather than being a condition (i.e. it could be “at 
once” instead). 

We note that by convention C”, = 0 for 0 <n < _m and hence P(10,5,4,5) = C#C8/C2° = 0 since CZ = 0. 
“Probability value” should be understood in this sense. More clarifications about this issue will follow. 
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The actual probabilities are obtained by taking the integral of the probability density function over a given 
range of the random variable (i.e. the range whose probability is required). For example, the probability 
of x being between —1 and +1 is given by: 

+1 

P(-isa<+)=|- flea (38) 

-1 
In the subsections of this section we investigate some of the well known and commonly used probability 
density functions. It is noteworthy that probability density functions have usually an advantage over the 
corresponding mass functions by being more manageable analytically and computationally since dealing 
with continuous variables (e.g. through differentiation and integration) is generally easier than dealing 
with discrete variables (e.g. through differencing and summation). 


Problems 


1. Justify the properties of the density function (as expressed by Eqs. 86 and 87). 
Answer: 
e Eq. 86 is justified by the axioms of probability (see axiom 1 of § 3.1) noting that the probability is 
given by f(a) dx and dz > 0. 
e Eq. 87 is justified by the axioms of probability (see axiom 2 of § 3.1). 
PE: Compare between the properties of density functions (as expressed by Eqs. 86 and 87) and the 
corresponding properties of mass functions. 
2. Make a comparison between mass function and density function. 
Answer: For example: 
e Mass function is discrete, while density function is continuous (and hence the normalization condition, 
for instance, is given as a sum in the case of mass function as seen in Eq. 67, and is given as an integral 
in the case of density function as seen in Eq. 87). 
e Mass function is a “probability” function (and hence it gives probability directly), while density 
function is a “probability rate” function (and hence it gives probability indirectly). This can be seen 
clearly from the fact that the probability of x; in the discrete case is given by P(x;) while the probability 
of a; in the continuous case is given by f(a;)dx and not by f(a;).!'°%! In fact, this should (partly) 
explain our use of different symbols for mass and density functions, i.e. we use P for mass function 
and f for density function. 
e The values of mass function cannot exceed one while the values of density function can exceed one, 
i.e. it is impossible to have P(x;) > 1 since P(x;) < 1 but it is possible to have f(a;) > 1. 
PE: Justify (in technical terms) the last property given in the answer of this Problem (by linking it 
to the other given properties). 
3. Give some examples of functions that can represent probability density (i.e. they meet the conditions 
of Eqs. 86 and 87). 
Answer: The following functions can represent probability density because they meet the conditions 
of Eqs. 86 and 87. 
(a) f(@)= § (O<x<4), f(z7)=0 (a@<Oanda>4). 


f(z) = 

(b) f(x) = 2) (-00 < & < +00). 

(c) f(z)=3, (-l10<2<10), f(z)=0 («<-10and x > 10). 
(a) f(z) =SS (00 < 2 < $00). 
PE: Give more examples like the given ones. 

4. Which of the following can be a probability density function: 


(a) f(z) = 305) (-1<2<1), f(x)=0 (w@<-lands>1). 


7 


(c) fe) = (-3<2<4), f(x)=0 («<-Fandz> 4). 


[108] The readers are referred to Problem 1 of § 4.3.2 for further clarifications about this issue. 
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(d) f(z) = # (-3<a<4), f(z)=0 (a <-—3anda>4). 

Answer: 

(a) It can, because it satisfies the conditions of Eqs. 86 and 87. 

(b) It cannot, because it does not satisfy the condition of Eq. 87. 

(c) It can, because it satisfies the conditions of Eqs. 86 and 87. 

(d) It cannot, because it does not satisfy the condition of Eq. 86. 

PE: Verify the answer of this Problem by verifying the conditions of Eqs. 86 and 87 in each case. 
5. Find the constant c in the following probability density functions: 


(a) f(@)=ce* (OX x < 00). (b) f(z) =0.125a+e (l<a <8). 
(c) f(z) =ca? (-1<a< +1). (d) f(x) = cln(z) (l<a<7). 
Answer: Any probability density function must satisfy Eq. 87 (as well as Eq. 86) and this can be 


used (i.e. by integrating the function over its entire range and equating the result to 1) to infer the 
value of c in each case, as follows: 


(a) ce ce ae Ss > 5=1 > c= 2. 
(b) fo (0.125¢+c)de=1 + ‘48H 1 + c=i 
(c) Yh cx? dx = 1 _ 2 =] — c=3 
7 
(d) ff, cln(x)dr=1 > [7In(7) —6]c=1 > c= mins 
PE: Find the constant c in the following probability density functions: 
(a) f(z) =ce3>  (0<a2< oo). (b) f(z) =05¢a+e (1<a< 3). 
(c) f(z) =ca* (-2<2< +42). (d) f(z) =cln(2x) (1<a<5). 


4.2.1 Uniform Distribution 


The continuous uniform distribution (which is the simplest density function) is given by: 


7 a (a<a<b) 
12) ‘i (otherwise) ey) 


where a and b are given real constants. For example, if a particle is moving uniformly on a unit circle then 
the probability of its position on the perimeter (i.e. the probability of finding the particle in the interval 
0 and 27 at a randomly selected instant) is subject to this distribution. For the continuous uniform 
distribution the mean jy: and the variance V are given by (see Problem 10 of § 5.1 and Problem 7 of § 5.2): 


a+b _ (b-a/ 
CB V= 3 


(90) 


Problems 


1. Give some examples of the continuous uniform distribution. 
Answer: 
e Having a number between 0 and 27 (representing angle in radian) when spinning a (fair) roulette 
wheel. 
e Getting a number between 47 and 107 (representing perimeter of circle) when randomly plotting a 
circle of radius 2<r< 5. 
Note: the continuous uniform distribution is rather artificial and hence it is difficult to find real life 
examples of this distribution that are really and exactly uniform. However, this generally applies to 
most, if not all, distributions since they are essentially idealizations of real life distributions which 
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involve many intricate factors and complexities. Also see Problem 3. 
PE: Give more examples for the continuous uniform distribution. 

2. Show that the continuous uniform distribution is normalized. 
Answer: From Eq. 89 we have: 


+oo b b 
1 x b-—a 
ye. fle) de = f saa#=|p25] -Fae7? 


PE: Explain and justify all the details of this derivation. 

3. Let assume that we have a random number generator that can generate real numbers in the interval 
(0, 1].[4°91 What is the probability that the generated numbers are between 0.6 and 0.95? 
Answer: Since these numbers are randomly generated, we can assume that they are equally likely 
and hence the generated numbers are uniformly distributed. Accordingly, the probability that the 
generated numbers are between 0.6 and 0.95 should be 0.95 — 0.6 = 0.35. 
PE: Create a similar Problem in which coupled (real) random number generators are supposedly used 
to select points in a rectangle. 


4.2.2 Normal Distribution 


The normal or Gaussian distribution (which is a function of x and is parameterized by yz and V) is given 
by: 

: e ae (—00 < & < ow) (91) 
V2nV 


This distribution is possibly the most important probability distribution (at least for the continuous 
probability distributions). This is not only because of its beneficial characteristics and its validity as 
an approximate model for many physical and non-physical phenomena, but also because its ability to 
represent and replace other distributions (e.g. binomial and Poisson) approximately in many cases and 
circumstances since it represents (under certain conditions) the continuous limit of these distributions. 
By definition, the mean of the normal distribution is y. and the variance is V (see Problem 10 of § 5.1 and 
Problem 7 of § 5.2). 
It is important to note the following about the normal distribution: 
e The standard normal distribution is a normal distribution with 4 = 0 and V = 1, i.e. 


f(x) = f(x,u,V) 


f(z) = ae"? (92) 


where the subscript s refers to “standard”.!!"°l This distribution was exceptionally important in the old 


days where calculations were made with the help of tables in which the values of this standard distribution 
are listed and used (with translation and scaling) to calculate normal distributions of various shapes and 
forms. However, with the wide availability of computing equipment and increasing reliance on them these 
days the importance of this standard distribution is reduced. 

e The normal distribution can be used (and is used) to approximate other distributions under certain 
conditions. In fact,, according to the central limit theorem (refer to § 6.2.3) all well-behaved probabil- 
ity distributions converge to the normal distribution under certain conditions. For example, the normal 
distribution can be used (and is widely used) as an approximation to the corresponding binomial distri- 
bution, i.e. with the mean and variance of the normal distribution being given by those of the binomial 
distribution, that is: = np and V = np(1 —p).!"4] In fact, this approximation is reliable in most 


1109] Th fact, random number generators do not really generate real numbers. 

[110] We note that the standard normal distribution is usually given as a function of z = (a — )/VV where pu and V are the 
mean and variance of the original normal distribution which is standardized through this transformation and x is the 
random variable of the original distribution. 

[111] We note that the normal distribution is a limiting case for the binomial distribution when n > oo and p stays finite (so 
that np > oo). 
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cases. Moreover, the approximation becomes more reliable when n is large or/and p is close to 1/2, i.e. it 
generally improves with increasing n and with p approaching 1/2 (see Problems 7 and 8). 


Problems 


1. Give some examples of the normal distribution. 
Answer: It is difficult (if not impossible) to find a real life example that is normally distributed 
exactly. However, many real life random variables are normally distributed approximately, and the 
approximation in many cases is very good (noting as well that the normal distribution can be used as 
an approximation to other distributions in certain circumstances, as indicated above). Examples of 
such variables are the weight of animals of certain species living in a given area, the height of people 
in a city, the weight of newborn babies (of a given ethnicity and location), and the lifetime of a certain 
type of electronic components and devices (operated under certain physical conditions). Also see § 
6.2.2 and § 6.2.3. 
PE: Investigate applications of normal distribution in physical and social sciences and find more real 
life examples of this distribution. 

2. Calculate the following probability densities of the normal distribution given in the form f(z, u,V): 


(a) f(—33.6, —81.9, 3.1). (b) f (167.8, 161.8, 11.6). (c) f(0.0, 0.0, 0.09). 
(d) f(—43.8, 51.6, 0.1). (e) f(25.9, —31.8, 43.2). (f) f (199.2, —3.8, 13.7). 
Answer: From Eq. 91 we have: 
1 ; 
(a) f(-336,-81.9,3.1) = ——~e 4 ~ 8.752331 x 107165 
20 
1 
(b) f(167.8,161.8,11.6) = Eo e783 ~ 2.481852 x 107? 
20 
1 
(c) f(0.0, 0.0, 0.09) = aE e° ~ 1.32980760134 
1 
Lo ea —19763 
(d) f(—43.8,51.6,0.1) = aa e 02 ~ ~ 1.524318 x 10 
20 
1 57.72 18 
(e) (25.9, —-31.8,43.2) = ae e 86% -~ 1.117645 x 10 
At 
1 2032 
(f) (199.2, -3.8,13.7) = e274 ~ 7.297248 x 107%? 
V27.4An 
PE: Repeat the Problem for the following: 
(a) f (52.5, 49.4, 44.7). (b) f(—152.5, —177.5, 55.8). (c) f(559.6, 638.9, 215.2). 
(d) f(—69.2, —69.3, 0.01). (e) f (15.9, 78.2, 1.6). (f) f (69.3, 0.0, 2.8). 
3. Outline the main properties of the normal distribution as represented by Eq. 91. 


Answer: 

e It peaks at x = p. 

e It is symmetric with respect to the vertical line x = wp. 

e It is normalized to unity, i.e. its integral between —oo and +oo equals 1. 

e Its graph is bell-shaped. 

e Its mean is y and its variance is V (see Eq. 91). 

e It is used as an approximation to other distributions (such as binomial and Poisson) in some limiting 

cases. 

PE: Discuss the significance of the above properties. Also add more properties if you can. 
4. Show that the normal distribution is normalized. 

Answer: From Eq. 91 we have: 


+oo +00 +00 
1 (=n)? 1 (=n)? 2nV 
x) dz = e 2 dr= / e wo = =1 
I. A ) a V Q7V V 27V —oo 27V 


4.2.2 Normal Distribution 113 


0.4 


0.3 


f(x) re 


0.1 


0) 
-15 -13 -11 -9 -7 -5 -3 -1 1 3 5 7 9 11 13 #15 
Xx 


Figure 23: See Problem 5 of § 4.2.2. The numbers on the top (ie. 2 = —7 and yu = 7) belong to the 
profiles beneath them. 


where we used standard integration techniques in the third step.!!"7! 


PE: Explain and justify all the details of this derivation. 

5. What is the effect of varying y. and V on the position and shape of (the curve representing) the normal 
distribution? 
Answer: Noting that for the normal distribution ~ represents the mean and V represents the variance, 
increasing pz results in shifting the peak of (the curve representing) the distribution to the right, while 
increasing V results in flattening (the curve representing) the distribution.''5] See Figure 23 where 
we plotted the normal distribution for ~ = —7 and yw = 7 each with V = 1,4,9. 
PE: Plot the normal distribution for a number of p’s and V’s (similar to Figure 23) and hence confirm 
the observations that we made about the effect of varying 4» and V on the position and shape of the 
distribution. 

6. Show that in the normal distribution: 
(a) About 68% of the values are within VV from the mean.!!"4I 
(b) About 95% of the values are within 2VV from the mean. 
(c) About 99.7% of the values are within 3/V from the mean. 
Answer: All these results can be obtained simply by integration using the standard methods of cal- 
culus, that is: 


(=p)? 


e 2V- ~ 0.9973 


[112] Tl fact, we used the error function. 

[113] “Flattening” means increasing the width of the profile and hence lowering the height of the peak (due to normalization). 

[114]“The values are within” means the probability of getting the random variable within the given interval (which is 
determined by the limits of the integrals in the upcoming answer). 
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Figure 24: The plot of the normal distribution corresponding to the binomial distribution with n = 100 
and p = 1/2. See Problem 7 of § 4.2.2. 


PE: Show more details about these integrations (using calculus books or integral calculators if neces- 
sary). 

7. Compare the normal distribution to the binomial distribution for the case of normal distribution 
corresponding to the binomial distribution with n = 100 and p = 1/2. 
Answer: For the binomial distribution with n = 100 and p = 1/2 we have 4 = 50 and V = 25 (see 
Eq. 72), and hence the normal distribution that corresponds to this binomial distribution is (see Eq. 
91): 


1 _ (@=50)? 
e 50 
V 5070 


We plotted this f(a) alongside the corresponding binomial points on the same graph (see Figure 24). 
As we see, the two are almost identical (noting their different nature as continuous and discrete). Also 
see Problem 8. 
PE: Explain why for the normal distribution to be more reliable as an approximation to the corre- 
sponding binomial distribution we should have n relatively large or/and p close to 1/2. Try to create 
plots like Figure 24 (using for example spreadsheets) in which you vary n and p so that you can see 
graphically the effect of varying these parameters on the quality of approximation. 

8. Make plots of the binomial distribution for n = 140 with p = 0.05, 0.20, 0.35, 0.50, 0.70, 0.90 and their 
corresponding normal distribution. 
Answer: See Figure 25. 
PE: Do the following: 
e Comment on the quality of agreement between the binomial distribution and the corresponding 
normal distribution in all cases. 
e Comment on the effect of varying p on the quality of agreement. 
e Investigate the effect of varying n on the quality of agreement. 
e Justify the shift of the peak to the right as p increases. 
e Explain the observed change of the profile (ie. the variation of the width of the profile and the 
height of its peak) as a result of varying p (where the width/height increases/decreases first then it 
decreases /increases afterward). 
e Should we expect the agreement between the binomial distribution and the corresponding normal 
distribution to deteriorate as V + 0 (see Problem 2 of § 4.2)? 


f(x) = 
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Figure 25: Comparison between the binomial distribution (filled circles) and the corresponding normal 
distribution (solid curve) for n = 140 with p = 0.05 (top left), p = 0.20 (top right), p = 0.35 (middle left), 
p = 0.50 (middle right), p = 0.70 (bottom left) and p = 0.90 (bottom right). For the corresponding normal 
distribution we have = np and V = np(1—>p) in each case. The horizontal axis in each frame represents 
x (corresponding to k of the binomial distribution) and the vertical axis represents the probability density 
f(x) [corresponding to the probability P(k) of the binomial distribution]. See Problem 8 of § 4.2.2. 
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9. Write a simple program that calculates the probability densities of the normal distribution. 

Answer: See the NormalDensity.cpp code which calculates the individual values of probability density 
of the normal distribution. 
Note: to do extensive calculations of normal distribution (i.e. on discrete points over a certain 
range of x corresponding for instance to the range of k of a corresponding discrete distribution) 
conveniently, we wrote another code (see the NormalDistribution.cpp code) in which we put the core 
of the NormalDensity.cpp code into a k loop and output the results to a file. This code is especially 
useful for doing extensive calculations in extreme cases (such as those cases involving very large or/and 
very small numbers). 
PE: Why the NormalDensity.cpp code (as well as the NormalDistribution.cpp code) may be needed 
to calculate the probability densities of the normal distribution despite the fact that the probability 
densities of this distribution are easy to calculate by a handheld calculator or a spreadsheet? 

10. Compare the binomial distribution (with n = 1000 and p = 0.1,0.5,0.9) to the corresponding Poisson 
and normal distributions by plotting them on the same figure. 
Answer: We made this comparison in Figure 26.!!15! As we see, for this case (in which n is large) 
the Poisson distribution is fairly close to the binomial distribution at low p (i.e. p = 0.1) but it differs 
considerably when p increases to 0.5 and the difference is exacerbated by increasing p to 0.9.!!!6 On 
the other hand, the normal distribution is almost identical to the binomial distribution in all cases (i.e. 
the cases of p = 0.1,0.5,0.9) although the agreement between the two distributions is best in the case 
of p= 0.5. 
PE: Repeat the Problem for n = 2000 and p = 0.05, 0.5, 0.95. 


4.2.3 Exponential Distribution 


The exponential distribution (which is a function of x and is parameterized by q) is given by: 
f(z) = f(z, a) = ae" (0<xa<c,a>0) (93) 


The mean yp and the variance V of the exponential distribution are given by (see Problem 10 of § 5.1 and 


Problem 7 of § 5.2): 
1 1 
b= a ie a2 (94) 
It is important to note the following about the exponential distribution: 

e The exponential distribution is usually used to model the distribution of waiting time between consec- 
utive events of Poisson type (or lifetime of processes and events of this type). 
e The exponential distribution is seen as the continuous limit (or counterpart) of the discrete geometric 
distribution (see § 4.1.5) and hence they are distinguished by certain properties (such as being memoryless 
or being models for waiting times). 


Problems 


1. Give some examples of the exponential distribution. 
Answer: 
e The distribution of waiting time between consecutive tornadoes (where tornadoes usually happen). 
e The distribution of waiting time between consecutive earthquakes (where earthquakes usually hap- 
pen). 
PE: Give more examples of the exponential distribution (e.g. from atomic transitions and radioactiv- 


ity). 


1115] We note that the calculations are performed using our codes BinomialDistribution.cpp, PoissonDistribution.cpp and 
NormalDistribution.cpp. 

[116] We should remember (see § 4.1.4) that for the Poisson distribution to be a limiting case (and hence a good approximation) 
to the binomial distribution we should have p — 0 (as well as n + oo). 
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Figure 26: Comparison between the binomial distribution (with n = 1000 and p = 0.1,0.5,0.9) to the 
corresponding Poisson and normal distributions. For the corresponding Poisson distribution we use \ = np 
and for the corresponding normal distribution we use u = np and V = np(1 — p). The horizontal axis in 
each frame represents k (for binomial and Poisson which corresponds to x for normal) and the vertical 
axis represents the probability P(k) [or the density f(x)]. For clarity (as well as other practical reasons), 
we use continuous curves to represent discrete distributions (and vice versa). See Problem 10 of § 4.2.2. 


4.2.4 Other Continuous Distributions 118 


2. Show that the exponential distribution is normalized. 
Answer: From Eq. 93 we have: 


+00 [o-e) —axr Cc 1 [oe] 
‘| fle) ax = f ac dr = a|* =a|- | feet 
es 0 =O 5 me | ray 


PE: Explain and justify all the details of this derivation. 

3. Justify Eq. 93 (assuming the exponential distribution to be a model for the distribution of time periods 
between consecutive Poisson-like events). 
Answer: If the average number of events happening in a unit interval is a then in an interval of size 
x we should have (on average) ax events. Now, if we assume Poisson-like events then we should have 
(in the corresponding Poisson distribution) 4 = ax and k = 0 (since in the interval x no event happens 
because x is supposedly an interval between events). Hence, from Eq. 75 we get: 


(ax)? eT ot zt 
P(0)= =e ee we 
Now, to get f(x) for the exponential distribution from the P(0) of the corresponding Poisson dis- 
tribution we argue that the probability of an event occurring in an infinitesimal time interval d(az) 
(corresponding to ax in e~%) is:!47I 


dF = P(0) x d(ax) = P(0) x adx = ae "dx 


Thus;!118] 


ax 


dF 
= = ae 


fa) = = 
PE: Can the (continuous) exponential distribution be considered a limiting case to the (discrete) 
Poisson distribution (as the continuous normal distribution is considered a limiting case to the discrete 
Poisson distribution for instance)? Justify your answer. 

4. What is the effect of varying a on the shape of the exponential distribution? 
Answer: Noting that e? = 1 and a > 0, we have f(z = 0) = a which means that a is the y-intercept 
on the (positive) y axis and hence increasing/decreasing a results in raising /lowering the start point of 
the curve representing this distribution (although this point is not included in the distribution). Also, 
a is the rate of drop of the exponential function (i.e. how fast it decreases noting that « > 0) and 
hence increasing/decreasing a results in increasing/decreasing this rate of drop. Both these facts are 
inline with the normalization of the distribution (since the area under the curve should be constant, 
i.e. equal 1). In Figure 27. we plotted this distribution for a number of a’s where these features can 
be seen clearly. 
PE: It is claimed (as indicated earlier) that the exponential distribution is the continuous counterpart 
of the discrete geometric distribution. Justify this claim by comparing Eq. 80 to Eq. 93 and comparing 
Figure 22 to Figure 27. Also explain why f(x) of the exponential distribution can exceed 1 while P(k) 
of the geometric distribution cannot. 


4.2.4 Other Continuous Distributions 


There are many other continuous distributions some of which are outlined in the following points: 

e Gamma distribution: this is a generalization of the exponential distribution which we investigated 
earlier (see § 4.2.3), and hence it represents the distribution of time intervals (ie. waiting time) before 
the r*” event in a Poisson-type series of events.!!191 The gamma distribution (which is a function of x 


[117] We use F here to indicate that this is actually a cumulative probability function (which will be investigated later; see 
§ 4.3). 

[118] The reader is referred to Eq. 101 which will be investigated later. 

[119] We note that in the exponential distribution the waiting time is before the occurrence of the 15* event. Hence, the 
gamma distribution (as represented by Eq. 95) reduces to the exponential distribution (as represented by Eq. 93) for 
r= 1. Similarly, Eq. 96 reduces to Eq. 94 for r = 1. 
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x 


Figure 27: The plot of Problem 4 of § 4.2.3. The circles indicate that 7 = 0 is not included. 


and is parameterized by a and r) is given by:!1?° 


1o=i@an= — ene (a>0,r €N,0 <2 <0) (95) 


The mean yz and the variance V of the gamma distribution are given by: 


r 
pat Vag (96) 
a a 
e Cauchy distribution: this distribution (which may also be called Lorentz distribution or Cauchy- 
Lorentz distribution) is given by:!!?!] 


1 


1) Fea 


(—0o0 < 4% < 0) (97) 
The mean and variance of this distribution are not defined and hence it is described as pathological. In 
fact, this distribution is used in the literature as a typical example of pathological distributions. However, 
in our view it is sensible for the Cauchy distribution (and its alike) to have a mean despite the fact that 
the integral defining its mean (i.e. Eq. 116) is divergent. This is because the mean can be inferred from 
the symmetry of the function representing this distribution with respect to x = 0 and hence its mean is 0 
by the rule of symmetry which will be investigated later (see the notes of § 5.1 as well as Problem 12 of § 
5.1). Moreover, we can use the Cauchy principal value to represent the value of the integral defining its 


[120] Tf we note that for all positive integers r we have I(r) = (r — 1)! then this equation can be written as: 


ar gral ee! 
fe) = Tae 
where [ is the gamma function (and that is why it is called “gamma distribution” of order r). In fact, the gamma 
distribution can be extended to all positive (real) numbers r (but we do not follow this issue in this book). 
1121] Tm fact, this form is a special case of the more general form of the Cauchy distribution and hence it may be called 
the standard Cauchy distribution. We also note that the Cauchy distribution is a special case of the Breit-Wigner 
distribution (which is commonly used in quantum physics). 
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mean. So we can say: despite the fact that the integral defining the mean of the Cauchy distribution (and 
its alike) is not defined (and hence the mean of this distribution is not defined formally), this distribution 
does have a “non-formal” mean which is sensible and useful to accept and use in many theoretical and 
practical situations. 


Problems 


1. Give some examples of the gamma distribution. 
Answer: Just generalize the examples of Problem 1 of § 4.2.3 (ie. the waiting time before the 
occurrence of the r‘” tornado or earthquake). 
PE: Why it is “the waiting time before the occurrence of the r‘”” rather than the (r — 1)'"? 

2. Justify Eq. 95. 
Answer: If we follow the method of Problem 3 of § 4.2.3 (noting that the gamma distribution is a 
generalization of the exponential distribution with k = r— 1 replacing k = 0) then we get: 


grt er ae 
PE = oi 
dF = P(r—1)xd(ar)=P(r—1)xadr=a (aa rr oe toes a’ x1 eo ot be 
dF a’ gt} “aH 
f(z) = ae: (r—1y! © 


PE: Reproduce the above argument independently, i.e. with no consideration of the argument of 
Problem 3 of § 4.2.3. 

3. Show that the gamma distribution is normalized. 
Answer: We have: 


+00 co rT pr—-l 
a” x 265 
I. flayae = f aa dx (Eq. 95) 
= eS D! 7. ee dx (a and r are constants) 
fe °° pI 
=< ee 1)! i (2) e"d (2) (substitution of y = ax) 


— r—-l1 o- 

Saar oe ae 

= (r 2 1! T(r) (gamma function identity) 
= aT (r—1)! ['(r) = (r — 1)! since r € N] 


PE: Why we treated a and r as constants? 
4. Show that the Cauchy distribution is normalized. 
Answer: We have (see Eq. 97): 


i Hest ie 1 a= ik = (1/2) — (—1/2) eae a 


oo W(1+2?) T 1 1 


—oo —oo 


PE: Why a distribution function can be accepted for representing probability even if its mean and 
variance are not defined, but it cannot be accepted if it is not normalized? 
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4.3 Cumulative Distribution Functions 


Cumulative distribution function was defined descriptively in the preamble of this chapter. Mathemati- 
cally, it can be defined by the following equation: 


F(a;) = P(a < 2;) (98) 


where F’ is the cumulative distribution function of the random variable x as a function of x; (which 
represents a given value of the random variable). Cumulative distribution function is used to obtain the 
probability of the random variable to be in a given range of values (see Problem 1 of the present section 
as well as Problem 3 of § 4). As we will see (refer to Eqs. 99 and 100), cumulative distribution function 
represents the sum of mass function (for discrete random variables) or the integral of density function (for 
continuous random variables) and hence it is normalized to unity (noting that mass function and density 
function must be normalized to unity), i.e. we have F'(co) = 1. 


Problems 


1. Give some examples of how the probability of the random variable to be in a given range can be 
obtained from the cumulative distribution function. 
Answer: For example (see Eq. 98): 


P(aj<a<2j) = F(a;) — F(a) 
P(z@>2a;) = 1- F(a) 


Similar intuitive relations can be easily obtained. 

PE: Give more examples (like the two given in the answer) 
2. On which law of probability the cumulative distribution function is ultimately based? 

Answer: It is the addition law of probability for mutually exclusive events. 

PE: Justify the given answer (considering both cases of discrete and continuous random variables). 
3. Outline some of the properties of cumulative distribution function F'(x). 

Answer: For example: 

e F(x) is a non-decreasing function of x. 

e F(x) is bounded from both sides, i.e. 0 < F(x) <1 [where F(—oo) = 0 and F(oo) = J]. 

e F(x) is a sum of the probability mass function and an integral of the probability density function. 

PE: Try to find other properties of cumulative distribution function. 


[122] 


4.3.1 Discrete Random Variables 


For discrete random variable, the cumulative distribution function is given by: 


F(a) = > P(x) (99) 


k<i 


This equation reflects the relationship between the probability mass function P and the cumulative distri- 
bution function F'. As we see, we obtain F'(x;) by adding the probabilities of all the values of the random 
variable less than or equal to 2;.!128) 


Problems 


1. Referring to Problem 5 of § 3.4, create a table representing the cumulative distribution function that 
corresponds to Table 4 (which represents a probability mass function). 
Answer: See Table 6. 
PE: Obtain from Table 6 the following probabilities (with x representing sum): 


[122] The reader should consider P(a; < x < aj), P(x; < a < xj), P(aj < x < a;) and P(x > 2;) considering the discrete 
case only. 

[123] We are assuming xz, < x; when k < i (i.e. x is ordered according to its index where the index refers to the sample 
points). 
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Table 6: The table of Problem 1 of § 4.3.1. 
Sum (z) 23 45 67 8 9 W ll 12 


: 1 3 6 10 15 21 26 30 33 35 © 636 
(a) P(a < 6). (b) P(x > 4). (c) P(a = 7). 
(d) P(8 < « < 10). (e) P(5<a< 8). (f) P(2<a<9). 
2. Find the following cumulative probabilities of the given mass functions (see Problem 5 of § 4.1): 
(a) P(k < 24) where P(k) = 2-* (k =1,2,...,00). 
(b) P(3 < k < 31) where P(k) = (6/1?) k~? (k =1,2,...,00). 


Answer: 
(a) We use the geometric series formula, that is: 


24 24 

|, 1 ft (4/2)24 1 16777215 
< 24) = pie = a 

P(k < 24) », eres | i—(jfay | 717 ha 16777216 


~ 0.9999999404 


(b) We have: 
31 31 


G56 1 
PB<kS3l)=)) Sa =a DL p ~0-1532460141 
k=4 k=4 


PE: Find the following cumulative probabilities of the given mass functions: 
(@)PG<h< 17) where P (kh) = 2-* (K=1,2.25.500), 
(b) P(7 < k < 25) where P(k) = (6/17) k-? (k =1,2,...,00). 

3. Write a simple program that calculates the cumulative probabilities of the binomial distribution. 
Answer: See the BinomialCumulative.cpp code. 
PE: Explain how the BinomialCumulative.cpp code calculates these probabilities by using the rules 
of logarithm (despite the fact that the logarithm of a sum is not the sum of its logarithms). 

4. Use the BinomialCumulative.cpp code (of Problem 3) to calculate the following cumulative binomial 
probabilities given in the form P(n, p, ki,k2) where both k, and kz are included: 


(a) P(17,0.14, 0,11). (b) P(59, 0.65, 15, 42). (c) P(299, 0.38, 157, 299). 
(d) P(681, 0.51, 332,375). (e) P(1328, 0.85, 0, 1012). (f£) P(1569, 0.34, 0, 873). 
Answer: Using the BinomialCumulative.cpp code we get: 

(a) P(17,0.14, 0,11) ~ 0.999999824134. (b) P(59, 0.65, 15, 42) ~ 0.872337883771. 
(c) P(299, 0.38, 157, 299) ~ 2.47781558955 x 107”. (d) P(681, 0.51, 332, 375) ~ 0.871942528426. 
(e) P(1328, 0.85, 0, 1012) ~ 2.79892295629 x 1071”. (f£) P(1569, 0.34, 0,873) ~ 1.00000000000. 
PE: Repeat the Problem for the following: 

(a) P(46, 0.44, 0,21). (b) P(428, 0.91, 33, 283). (c) P(810, 0.26, 268, 810). 
(d) P(926, 0.57, 28, 718). (e) P(2689, 0.05, 194, 1962). (f) P(1883, 0.31, 581, 1673). 


5. Write a simple program that calculates the cumulative probabilities of the Poisson distribution. 
Answer: See the PoissonCumulative.cpp code. 
PE: Plot a flowchart representing the algorithm of the PoissonCumulative.cpp code. 

6. Use the PoissonCumulative.cpp code (of Problem 5) to calculate the following cumulative Poisson 
probabilities given in the form P(A, ki, k2) where both k, and ko are included or the form P(X, k > ka): 


(a) P(0.95, 0,5). (b) P(215, 14,59). (c) P(33.61, 23, 45). 
(d) P(6.7,6, 81). (e) P(7.82,k > 17). (f) P(2.2,0,59). 
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Answer: Using the PoissonCumulative.cpp code we get: 


(a) P(0.95, 0,5) ~ 0.999544461196. (b) P(215, 14,59) ~ 1.72509161758 x 10~°°. 
(c) P(33.61, 23,45) ~ 0.953449107032. (d) P(6.7,6,81) ~ 0.659350551358. 

(e) P(7.82,k > 17) ~ 1.24868076898 x 10-3. (f) P(2.2,0,59) ~ 1.00000000000. 

PE: Repeat the Problem for the following: 

(a) P(53.8,k > 61). (b) P(4.8, 0,6). (c) P(231.7, 21, 188). 
(d) P(19.26, 36, 187). (e) P(39.5, 78, 92). (f) P(0.18, 4, 9). 


7. Give formulae for the cumulative distribution of the following discrete probability functions: 
(a) Uniform. (b) Binomial. (c) Poisson. (d) Geometric. (e) Negative binomial. 
Answer: We have (noting that x € R and n is a given positive integer): 
(a) From Egs. 99 and 69 we get : 


0 (a < 21) 
F(a) = k/n (tp <2 < Upy1, kK =1,2,...,n—1) 
1 (x > &n) 


(b) From Eqs. 99 and 71 we get: 


0 (x < 0) 
P=. Coni py [0 < @ <n,k = floor(2)| 
1 21) 


where floor() is the greatest integer less than or equal to x. 
(c) From Eqs. 99 and 75 we get: 


Zz 0 . (a < 0) 
F(z) = i Ae [0 <a <w,k = floor(z)| 


(d) From Eqs. 99 and 80 we get: 


F(z) = 0 (x < 1) 
1—(1—p)* [1 <2 <w,k = floor(z)| 
We note that F(x) = 1— (1—p)* (for 1 < x < 0) because: 
k * 
F(z) = 5 /(1-p)*"p [k = floor(x)| 
i=1 
1-—(1—-p)* ; ; 
=p eometric series 
tp) (g ) 
=1—(1—p)* 
Also see Problem 6 of § 4.1.2. 
(e) From Eqs. 99 and 82 we get: 
0 (x < 0) 
F(t) = ; : 
(2) — Cit] yf (1 — p)? [0 < « < w,k = floor(z)| 


PE: Explain and justify (in words) each one of the stated cumulative distributions. 
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4.3.2 Continuous Random Variables 


For continuous random variable, the cumulative distribution function is given by: 
Fe) =f fa)ae (100) 


where f(a) is the probability density function. From this equation we conclude: 


= f(z) (101) 


These equations (i.e. Eqs. 100 and 101) reflect the relationship between the probability density function 
f and the cumulative distribution function F’. We can also conclude easily (using Eq. 100 as well as the 
properties of integration) that: 


Fa, <a@ 525) = Fajr le) = ij f(x) dx — f(x) dx 
S ‘ hay ae: | - fla)\de = i (a) de (102) 


It is worth noting that whether we use strict inequality (ie. <) or non-strict inequality (i.e. <) in 
determining the limits of the cumulative probability distribution is important in the case of discrete 
distributions but not in the case of continuous distributions. This is because including and excluding 
a single point (i.e. a limit point) is important for the discrete distributions but not for the continuous 
distributions. The reason is that the values of probability in the discrete distributions are attached 
to individual points and hence a single point has a finite probability value and thus its inclusion and 
exclusion (i.e. in a sum) affect the cumulative probability. On the other hand, the values of probability in 
the continuous distributions are attached to areas under a curve (rather than individual points) and hence 
a single point has no finite probability value and thus its inclusion and exclusion (i.e. in an integral) have 
no effect on the cumulative probability. For example, F(x; < x <ax,;) and F(a; < x < xj) are not equal 
(in general) if F' is a discrete cumulative distribution but they are equal if F is a continuous cumulative 
distribution. This is because in the former case the probability of x; is finite and hence the inclusion 
and exclusion of x; in the sum of probabilities has an effect on the cumulative probability, while in the 
latter case the inclusion and exclusion of x; has no effect on the cumulative probability because regardless 
of whether we include or exclude x; the cumulative probability is determined by the same integral, i.e. 
ie f(x) dx. 


Problems 


1. Referring to the above statement: “On the other hand, the values of probability ... have no effect on 
the cumulative probability”, it may be claimed that there is a contradiction between having a finite 
value of probability density for a given point 2; [since f(a) is not zero in general] and the fact that 
the inclusion and exclusion of that point (i.e. x;) in the integral have no effect on the cumulative 
probability. Discuss this issue in detail. 

Answer: We note the following: 

e As indicated above, the nature of discrete and continuous probability (as expressed and represented 
by the distribution function) is different. This can be seen clearly in the normalization conditions of 
these probabilities, i.e. 


—Co 


+00 
S- pi =1 (discrete) i] f(«)dx=1 (continuous) 


As we see, the continuous probability is not defined as f(x) but as f(x) dx (since it is an area under 
a curve) and that is why the normalization condition is not )7, f(#;) = 1 as in the case of discrete 
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probability. Accordingly, if dx > 0 (i.e. by choosing a single point) then f(#;)dx > 0 even though 
f(a;) is finite. So in brief, f(x;) is not the probability of x; but it is the probability density (or rate) at 
x; from which we obtain the probability of x; (within an infinitesimal interval dx in the neighborhood 
of x;) by multiplying f(2;) by dz. 
e In our view, this can be seen as demonstration of the uncertainty principle (which is commonly 
associated with quantum physics although it has many applications and instances in many other 
fields such as mathematics and science in general).!!?4] This means that obtaining an exact value 2; 
(e.g. by observation or measurement) with no uncertainty (represented by dx or Az) is impossible 
at least practically. However, our view may be challenged by the question: why do we not have 
such an uncertainty in the discrete case (if uncertainty is supposedly inherent to our observation and 
measurement)? But we can reply by noting that certain discrete variables do not have this type 
of uncertainty (e.g. integers representing the number of people or objects). Moreover, having an 
uncertainty in the discrete case (i.e. in some instances of discrete variables) is less obvious and has 
almost no practical consequences because the individual values of 7; are usually separated by large 
gaps (compared to the value of uncertainty dx or Ax) and hence the uncertainty on both sides of x; 
(by a tiny margin dx or Az) can be ignored because this has usually no theoretical or practical effect 
although we should always keep in mind that there is an uncertainty even in (some instances of) the 
discrete case (e.g. discrete voltages in digital electronic devices). 
PE: Try to expand the above answer by adding more discussion and arguments. 

2. Find the following cumulative probabilities of the given density functions (see Problem 5 of § 4.2): 
(a) F(0.52 < x < 3.5) where f(x) = 2e~?” (0< x < ow). 
(b) F(—0.45 < x < 0.69) where f(x) = (3/2) x? (-1 < x < 41). 
Answer: 


(a) 


3.5 3.5 
F(0S2<4<35)= i; Ber? da = | - al =—e 7 +e71 ~ 0.35254279999 
0.52 0.52 
(b) 
0.69 2 3 
3 +0.69 0.328509 + 0.091125 
F(-0.45 <a < 0.69) = i. or de = [=] = v = 0.209817 
0.45 2 2 10.45 2 


PE: Find the following cumulative probabilities of the given density functions: 
(a) F(1.37 < x < 2.46) where f(x) = 0.1252 + 0.25 (1 < & < 3). 


(b) F(3.9 < a < 6.8) where f(x) = qBOs (1< a2 <7). 


3. Write a simple program that calculates the cumulative probabilities of the normal distribution. 
Answer: See the NormalCumulative.cpp code. 
PE: Write down the mathematical formulae used in the calculations performed in the NormalCumu- 
lative.cpp code. 

4, Use the NormalCumulative.cpp code (of Problem 3) to calculate the following cumulative normal prob- 
abilities given in the form F'(u, V, 21,22) where x1 < x2 (noting that 2; can be —oo and x2 can be +00): 


(a) F(12.1, 9, —0oo0, 17.3). (b) F(—11.5, 5.76, —3.7, +00). (c) F'(92.7, 6.6, 33.5, 88.4). 
(d) F(0, 39.8, —oo, 40.1). (e) F(19.2, 63, 18.1, 18.3). (f) F(217, 71.3, 274, +00). 
Answer: Using the NormalCumulative.cpp code we get: 

(a) F'(12.1, 9, —oo, 17.3) ~ 0.95848178031. (b) F(—11.5, 5.76, —3.7, +00) ~ 0.00057702504. 


Wa SE 


(c) F(92.7, 6.6, 33.5, 88.4) ~ 0.04708763686. (d) F(0,39.8, —co, 40.1) ~ 0.99999999990. 
(ec) F(19.2, 63, 18.1, 18.3) ~ 0.00997267574. (f) F(217, 71.3, 274, +00) ~ 7.37143679430 x 10722, 


[124] The reader is referred to our book “The Epistemology of Quantum Physics”. 


4.3.2 Continuous Random Variables 126 


PE: Repeat the Problem for the following: 
(a) F(1.78, 2.4, —co, 0.51). (b) F(—14.9, 0.9, —15, —13.2). (c) F'(37.2, 56.4, 41.4, +00). 


(d) F'(67.8, 2.5, 50, 55.3). (e) F(—9.3, 3.8, —11.6, +00). (f) F'(123.8, 25, —co, 99.2). 

5. The lengths L of nails produced in a factory are normally distributed with mean 4 = 3cm and variance 
V =0.2cm?. Find the percentage of nails whose length is less than 2.7cm or greater than 3.3cm. 
Answer: The cumulative probability F(u,V,L, < LZ < L2) is given by (noting that w = 3, V = 0.2, 
Ly = 2.7 and Lz = 3.3): 


1 ae (w=3)? 
F(3,0.2,2.7 < L < 3.3) = e 04 dx ~ 0.497665045639 
, : > V0.40 i 


Accordingly, the probability of the length being less than 2.7cm or greater than 3.3cm is: 
1 — F(3,0.2,2.7 < L < 3.3) ~ 0.502334954361 


which means that about 50.2% of the nails produced have lengths less than 2.7cm or greater than 
3.3cm. 
PE: Find the percentage of nails whose length is greater than 2.8cm. 

6. Give an example showing that the normal distribution is generally more convenient in calculation than 
the corresponding binomial distribution and hence justify the use of the normal as an approximation 
to the binomial (when the approximation of the binomial by the normal is valid). 

Answer: For example, let have a binomial probability problem where we have to calculate the (inner) 
cumulative probability P(n,p,ki < k < ke) = P(692,0.47,296 < k < 317). Now, the binomial 
cumulative probability is: 


317 
P(692, 0.47, 296 < k < 317) = oS C9? 0.47* 0.53°92-* ~ 0.26631785039 
k=296 


On the other hand, the corresponding normal cumulative probability F'(u, V, 296 < x < 317) is [noting 
that u = np = 325.24 and V = np(1 — p) = 172.3772]: 


1 [- _ (w@— 325.24)? 
—————— e 
Vv 344.75447 296 


sated dx ~ 0.25216025443 
As we see, the normal cumulative probability requires much less calculations than the corresponding 
binomial cumulative probability (noting that the binomial is the sum of 22 terms involving many large 
factorials while the normal is a single simple integral) and the results are reasonably close (i.e. the 
difference is tolerable for most practical purposes). 
PE: Repeat the Problem by giving another example. 
7. Give formulae for the cumulative distribution of the following continuous probability functions: 


(a) Uniform. (b) Normal. (c) Exponential. (d) Gamma. (e) Cauchy. 


Answer: We have (noting that x € R): 
(a) From Eqs. 100 and 89 we get: 


F (325.24, 172.3772, 296 < x < 317) = 


0 (x@ <a) 
F(x)= 4 =? (a<a<b) 
1 (x 2 b) 


(b) From Eqs. 100 and 91 we get (noting that erf is the error function): 


F(a) = ; f + erf @=3) (—0oo < x < 00) 
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(c) From Eqs. 100 and 93 we get: 


r=" (x <0) 


l—-e (0< a < ow) 


(d) From Eqs. 100 and 95 we get: 


0 (a <0) 
F(z) = \ [(r,ax) (0 oye 00) 
(r-1)! 


where I here is the lower incomplete gamma function. 
(e) From Eqs. 100 and 97 we get: 

arctan(x 1 

nes (—o0o < x < ov) 
T 2 

PE: Justify each one of the stated cumulative distributions by showing how to obtain the cumula- 
tive distribution from Eq. 100 in association with the probability density function of the particular 
distribution. 


4.4 Multivariate Probability Functions 


The previous investigations were largely about probability functions of a single random variable. However, 
two or more associated (or simultaneous) random variables can have a joint probability function that 
represents their interdependent distribution. These random variables could be discrete or continuous or 
mixed (i.e. some discrete and some continuous) where each one of these variables can be subject to 
a certain type of discrete or continuous function (such as those investigated earlier) depending on the 
nature of that variable. Moreover, these variables could be dependent or independent of each other (or 
they have mixed dependency). We can also have cumulative distribution functions for these multivariate 
joint probability functions (as well as ordinary distribution functions, i.e. mass and density functions). 

Most of the previous probability principles and rules (and even some proofs and arguments) generally 
apply (with proper adaptations and modifications) to these joint probability functions. For example, a 
joint probability (mass) function of two discrete random variables P(, y) [where C= Cok iy Bi eC, 
and y = 91, Y2,---,Yjs--++Yn] Satisfies the following properties:!1?5l 


Play) = 47 8) (P= Y=) (gefinition) (103) 
0 (otherwise) 
P(a,y) £0 (probability > 0) (104) 
S- S- P(ai,y;) =1 (normalization) (105) 
i=1 j=1 
tJ 
F(2i,y;) = S- P(r, y1) (cumulativity) (106) 
k=1 1=1 
Fla<a<b,c<y<d) = F(a,c) + F(b,d) — F(a,d) — F(b,c) (inner cumulativity) (107) 
P(a,y) = Py(x) Py(y) (if ~,y are independent) (108) 


where P,, and P, are the probability mass functions of x and y respectively.!1?° 


[125] Functions of two random variables are labeled as bivariate probability functions. We also note that m or/and n 
can be infinite. 

[126] The equation P(x, y) = Px(a) Py(y) represents the fact that two discrete random variables x, y are independent iff their 
joint probability mass function P(x, y) can be expressed as a product of a probability mass function of x only P(x) 
times a probability mass function of y only Py(y). 
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Similarly, a joint probability (density) function of two continuous random variables f(x,y) satisfies the 


following properties: 


flay dzdy =P (a; eX a; + dep sy = 9p dy) 


f(x,y) £0 
ei. f(a,y) dx dy =1 


F (xi, yj) = ae ae f(x,y) dx dy 


d pb 
Paa<ashe<ysd=f f[ fleydedy 


f(@,y) = fa (a) fy(y) 


where f, and fy are the probability density functions of x and y respectively. 


Problems 


1. Give some examples of discrete bivariate probability mass functions. 


Answer: 


e Discrete bivariate uniform-uniform: 


(definition) (109) 
(probability > 0) (110) 
(normalization) (111) 
(cumulativity) (112) 
(inner cumulativity) (113) 


(if «,y are independent) (114) 


[127] 


sis i=1,2,...,mand j =1,2,...,n 
P(xi,yj) = 4 7" ts 
0 (otherwise) 
e The distribution represented by the following table:!!?8! 
y 
1 2 3 4 5 

1 | 33/784 | 21/784 | 67/784 | 2/784 | 78/784 

2 2) 7/784 | 13/784 | 29/784 | 6/784 | 55/784 

3 | 42/784 | 63/784 | 5/784 | 21/784 | 48/784 

4 | 39/784 | 19/784 | 90/784 | 61/784 | 85/784 


PE: Show that the given examples are normalized. Also give more 


probability mass functions. 


examples of discrete bivariate 


2. Give some examples of continuous bivariate probability density functions. 


Answer: 


e Continuous bivariate uniform-uniform: 


1 


0 


e Continuous bivariate normal-normal: 


1 +.) p 
= —(a*+y")/2 
f(x,y) = xe 


(axa<bc<y<d) 


(otherwise) 


(—oo < @ < co, -00 < y< cw) 


PE: Show that the given examples are normalized. Also give more examples of continuous bivariate 


probability density functions. 


127] The equation f(a,y) = f(a) fy(y) represents the fact that two continuous random variables x,y are independent iff 
their joint probability density function f(x, y) can be expressed as a product of a probability density function of x only 


fx (x) times a probability density function of y only fy(y). 


[128] We note that x and y in this table represent the random variables while the entries in this table represent the probabilities 
of the pairs (cv =i,y = j) where i = 1,2,3,4 and j = 1, 2,3,4,5. 


Chapter 5 
Statistical Indicators 


In this chapter we present a brief discussion of some well known and widely used statistical indicators 
which are usually met in the literature and textbooks of probability theory and its applications, and 
hence awareness and basic understanding of these indicators represent a necessity to those interested in 
probability theory and its applications (noting that these indicators are also commonly used and met in 
other branches of mathematics and science and hence their necessity and usefulness are more general). 
These statistical indicators are used to summarize statistical data sets and present them in a compact 
form that demonstrates their properties and general features. Hence, these indicators are very useful in 
the comprehension and interpretation of data sets and the appreciation of their significance. 


5.1 Mean 


The mean (which may also be called the expectation or expected value although some people distinguish 
between the two) represents the average value (per trial) expected (for a random variable) in a large 
number of trials of a given random experiment. For example, if we throw a fair coin many times then 
we should expect to get H in about half of these throws and T in the other half, and hence if we assign 
a numerical value of 0 to H and a numerical value of 1 to T then we should expect to get on average a 
numerical value of 1/2 per throw. 

However, before we go through the details of this investigation we should draw the attention to the 
notation that we use to represent the mean. In fact, we use 4 to represent the mean but in two slightly 
different forms. So, we use bare yz or with a subscript (e.g. jz, which means the mean of the random 
variable x) to represent the mean as a specific number while we use jz with parentheses to represent the 
operation of taking the mean of the variable inside the parentheses and hence it is like a function. For 
instance, 4, means the number 6.25 (as an example) while u(x) or ju[z] means the operation of taking the 
mean of the variable «x. In fact, this notation applies even to the variance (which will be investigated in § 
5.2) and hence we use, for instance, V, and V(a) in the same capacity as ju, and ju(2).l9 

For a discrete real-valued random variable x (which can take distinct discrete values x; with corre- 
sponding probabilities p;), the mean is given by the sum: 


ole) = = ori = oa = pe (115) 


where n is the total number of trials, 2; is the i” value of the random variable x, n; is the number of trials 
in which x; is obtained, pj = P(a;) is the probability of x; (i.e. the value of probability mass function 
corresponding to «;), and i runs over all possible distinct values of x (noting that 5°;n; =n which means 
that 7 runs over n; not over 7). 

For a continuous real-valued random variable x (which can take continuous values between —co < x < 
oo), the mean is given by the integral: 


+00 
we) = fe f(e)ae (116) 


where f(a) is the probability density function of x. 
It is worth noting that the mean of a function of x follows the above style, that is: 


ul g(x)] = > 9(zi) Pla) (discrete variable) (117) 


[129] This similarly applies to the standard deviation where we use o and a(a). 


129 
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+00 
ul g(x)] =i g(x) f(x) dx (continuous variable) (118) 


where g is a function of the random variable z. 
The mean satisfies the following properties (with z,y being random variables defined over the same 
sample space): 


uC) =C (C is constant € R) (119) 

w(x +y) = w(x) + wy) (120) 
pax) = ap(x) (a is constant € R) (121) 
pax + C) = ap(x) +C (a,C are constants € R) (122) 
play) = po(a) p(y) (x and y are independent) (123) 
p[9(2)] = ula (2)] + #[g2(@)] [g(x) = n(x) + g2(2)] (124) 


These equations can be generalized to more than one variable. For example, Eq. 120 can be extended to 
more than two variables by repetitive application (see Problem 3). It should be obvious that for a random 
variable to have a mean, the sum of Eq. 115 (in the case of discrete) and the integral of Eq. 116 (in the 
case of continuous) should converge. 

For a bivariate probability function (see § 4.4) of random variables x and y the mean of x is given by: 


= 2 yy 2; P03, 95) (discrete variables) (125) 


te +oo 
i [ f(a, y) dx dy (continuous variables) (126) 


Similar definitions apply to the mean of the variable y, i.e. u(y). We may also consider the mean of a 
bivariate function g(x,y) and hence we have: 


- S- Se g(@i,Y;) P(%i, yy) (discrete variables) (127) 


+00 +00 
[o(z,y)] -/ / g(x,y) f(x, y) dx dy (continuous variables) (128) 


The covariance of two random variables x and y is defined as: 


Cov(z,y) = w[(@ — He) (y — Hy) | (129) 


where ft, = p(x) and py = p(y). The correlation (or correlation coefficient) of two random variables 
x and y is defined as: 


Cov(a, y) 
J V2 Vy 
where V,; = V(x) #0 and V, = V(y) #0. As we see, if Cov(z, y) = 0 then Cor(z, y) = 0 and the random 
variables x,y are then described as uncorrelated (otherwise they are described as correlated). Noting that 
the covariance of independent random variables is zero (see Problem 7), independent variables are always 

uncorrelated. 

There are a few important points to note about the mean: 

e The above definition of the mean (as given in the first paragraph of this section) is basic and lacks 
rigor, but it should be sufficient for our purpose. Moreover, conceptually (and even in some applications 
and according to some conventions) the mean is not the same as the expectation value, although for 
simplicity we prefer to treat them as identical. The readers are referred to the literature about the details 
of these issues and the difference in definitions and conventions related to them (noting that our book is 
introductory and hence these issues are out of scope). 


Cor(x, y) = (130) 
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e The mean can be sensibly attributed to the random variable, and hence we talk about the mean of a 
given random variable x. It can also be sensibly attributed to the probability distribution of the random 
variable, and hence we can talk about the mean of the mass function P(x) or the density function f(x) 
of a given random variable x. This should make sense by inspecting Eqs. 115 and 116 which involve 
both the random variable and its probability distribution. However, the “mean” primarily belongs to the 
random variable, and this should be evident from the meaning of “mean” as well as from its notation, e.g. 
w(x) OF fe. 

e The mean can be seen as a functional reflecting the characteristics of the distribution function of its 
random variable. In fact, this should add more justification to our observation in the previous point that 
the mean can be sensibly attributed to the probability distribution of the random variable (as well as to 
the random variable itself). 

e The purpose of our convention about the notation of mean and variance (i.e. yw or V with and without 
parentheses) is to avoid awful notations commonly used in probability and statistics which cause confusion 
and unnecessary complications (as well as extra effort in writing and typing). However, it should be noted 
that the difference between our two notations is not always clear-cut although this should not affect the 
rigor or comprehensibility in general. 

e “Mean” in this book means arithmetic mean (noting that there are other types of mean like geometric 
mean and harmonic mean; see Problem 8). 

e In the equations that involve more than one random variable (e.g. Eqs. 120 and 127) we generally 
assume that the random variables are defined over the same sample space. 

e The mean of a given probability distribution may not exist and hence the distribution then is classified 
as pathological (see the Cauchy distribution in § 4.2.4). However, if the mean does exist then it (unlike 
some other statistical indicators) is unique (i.e. with respect to a given variable). 

e The mean of a symmetric distribution with respect to z = c (where z is a random variable and c is a 
constant) is (a) = cand hence no calculation is needed in this case, i.e. we should inspect the distribution 
(which we want to find its mean) for a potential symmetry before deciding if we have to go through the 
calculation of its mean (see Problem 12). 

e The mean has the same physical dimensions as its random variable. 

e The mean has the same sign as its random variable (if the random variable has a fixed sign). So, if the 
random variable takes only non-negative /non-positive values then its mean will also be non-negative /non- 
positive. This can be inferred from Eqs. 115 and 116 noting that p; (in Eq. 115) and f (in Eq. 116) are 
non-negative. 


Problems 
1. Find the mean of the (discrete) data of: 
(a) The following list of values: {6, 6, 9.4, 10.7, 10.7, 10.7, 10.7, 15.6, 15.8, 19, 19, 21.7}. 


(b) Problem 5 of § 3.4 (see Table 4). 
Answer: From Eq. 115 we have: 


(a) 


1 2 A4+(4~x 10. 15.6 + 15. oe 21. 
TO aa x 6)+9.4+ (4 x 10.7) + 15.6 + 15.84 (2 x 19) + T 19-94 
1 12 
” 1 2 1 
Mgt DB ge NIE a 9g X 12=7 


PE: Find the mean of the following data set: {2.3, 2.5, 5, 4.2, 5, 5, 7.1, 6.3, 7.1, 7.1, 8.4, 7.1, 8.9, 9.6, 
11.5, 11.5, 7.6, 11.5, 17.4, 17.9}. 
2. Find the mean of the (continuous) data represented by the following probability density functions: 


(a) f(z) =§ (O<x<4), f(x) =0 (otherwise). (b) f(x) = ea) (—oo < & < +00). 


Answer: From Eq. 116 we have: 


5.1 Mean 132 


ua) = f eia)dr= fat ae=$ 


wa)= fese)den [2 PCD ae as 


—0o —oo TT 


(b) 


This result can also be obtained from the symmetry of sech(# — 7) with respect to x = 7. 
PE: Repeat the Problem for the following probability density functions: 


2 


(a) f(z) = Je ae) (0<a<2), f(x) =0 (otherwise). (b) f(x) = ae (—00 < @ < +00). 


3. Do some generalizations and adaptations to the properties of mean (see Eqs. 119-124). 
Answer: For example: 
e By repetitive application of Eq. 120 we get: 


Lt ps “| = wr (xi) 


where x; (i = 1,2,...,n) are random variables defined over the same sample space. Moreover, if we 
consider Eq. 121 as well we get (where a; are constants): 


lu (>: on] =>) waz) = > a) (131) 
i=1 i=1 i=1 
e We can extend Eq. 123 to get: 
L (II) = [[-@ ) 
i=1 i=1 
where x; (¢ = 1,2,...,n) are mutually independent random variables defined over the same sample 
space. 
PE: Give more examples of generalizations and adaptations to the properties of mean. 
4. Verify Eqs. 119-124 for discrete random variables. 


Answer: In the following we mainly use Eqs. 115, 117 and 127. 
e Regarding Eq. 119 we have: 


1 ce 
e Regarding Eq. 120 we have: 


De dai + yy) PC (Xi, yy) -[[dae (xi, yy) 
aD Pea Lud res] i 


= wa) + wy) 


l| 


u(x + y) 


aps Dw Pl (Ti, Yj | 
dt P(e) ¥ wats) 


l| 


e Regarding Eq. 121 we have: 
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e Regarding Eq. 122 we have: 


p(ax + C) = S"(azi + C) p( = Oe + (cX») =ap(x)+C 


t 


e Regarding Eq. 123 we have: 


= Dd (wins) Plein uy) = 7d vi ys p(w) Plus) = (Se 7) (=>. ») = u(x) u(y) 
e Regarding Eq. 124 we have: 


ulg(x)] = D_ r(o1) ses) =S° lai) ) [a1 (xi) + go(xa)| 


= =a Xi) gil Xi) 


PE: Justify each step of the above verifications. 

5. Verify Eqs. 119-124 for continuous random variables. 
Answer: The verifications are straightforward by using the properties of integrals as well as the 
equations given in the preamble of this section (noting that density function is normalized to unity): 


ae f(a) de = 

woty) = ff erwfewardy 

ie [2 steonaray| +] [> [> ateonraray 

| [fo (198) a} [29 fe) 4 
| 


- a f,(x) dx} + ie vce 


= p(x) + ply) 


a 


+ re) = p[91(z)] + u[92(z)| 


=. 
2 
T 


Hae = i awithas= of Bae =a) 
ieeeeey = [ow ‘CV ode= Je [> a pelce 
p(xy) = i i ry f(x,y) dwdy= f ee ry fx (x) fy(y) dx dy 


0° “ECO 
if ep n) ts / y fy(y) iy = b(x) u(y) 


+00 


plate] = ff ate)se)de= ff [anle) + anle)] Fo) ade 


lI 
=. = 39 
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—oco 


: [ao 1a 


— 


PE: Justify each step of the above verifications. 


6. Show that: 


Answer: We have: 


+oo 
fe 


Cov(a, y) = w(ay) — w(x) u(y) 


Cov(x, y) = u[(a — fur) (y — Hy) 
= M(LY — Mey — yk + Hafly) 


= p(xy) — 


May) — byH(x) 4 


= p(xy) — 


Maby — byba + be 


We note that fz, ly and fiz/t, are constants. 
PE: Explain in words each step of the above derivation. 
7. Show that the covariance of two independent random variables x and y is zero. 
Answer: From the result of Problem 6 we have: 


+ LL (fle by) 
r be by 
Ly 


Cov(a, y) = w(ay) — w(x) p(y) 
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g2(a) f (x) ts = p[91(x)] + u[92(x)] 


(132) 


(Eq. 129) 
(Eq. 131) 
(Eqs. 119 and 121) 


(definition) 


(definition) 


Now, if x and y are independent then from Eq. 123 we have pu(xy) = p(x) u(y) and hence Cov(a, y) = 0. 
Note: if we do not assume Eq. 123 then we may prove this result as follows (considering the continuous 


case) :1180] 


Cov(a, y) = | 2 he) (Y= Hy)| 


l| 


Co 


x. i [(e — He) (y — Hy)| f(x,y) dx dy 


ee is [(@ = te) (Y= My)| fol@) fyly) dx dy 


[et fala) i] [- 


7) fy(y) iy 


(Eq. 129) 
(Eq. 128) 
(Eq. 114) 
+00 
iy | fy(y) ty 
(Eq. 116) 
(Eq. 87) 


PE: It is claimed that the converse of this statement is not true in general, i.e. if the covariance is zero 
then the variables are not necessarily independent. Investigate this issue and determine if this claim 


[130] Ty fact, the purpose of this is to show the derivation in more details; otherwise both methods rest on the same principle 
(i.e. the joint probability function of two independent random variables, x and y, can be expressed as a product of a 
probability function of x alone times a probability function of y alone; see Eqs. 108 and 114). 
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is correct or not. [134] 


8. Define arithmetic, geometric and harmonic means and state the relation between them (assuming all 
the data are positive). 
Answer: The arithmetic mean pu is defined above (noting that it is the subject of this section). The 
geometric mean pg is given by: 


ma 1/n 
HG = II Xi 

i=1 

The harmonic mean jy is given by:!'3?I 
n 1 1 
pr =n| De 

=i" 

The relation between them is given by the inequality: 
MH SUG Spe 


PE: Verify the relation wy < a < p for the data set of part (a) of Problem 1. 

9. Find the mean of the following discrete distributions: 
(a) Uniform. (b) Binomial. (c) Multinomial. (d) Poisson. (e) Geometric. 
Answer: 
(a) Uniform: we assume the distribution to be of the form P(x;) = P(i) = p; = 1/n (where 
i=1,2,...,n). Accordingly: 


ja) = Se pea (Eq. 115) 

“1 1 

=) 5-7 (n; = 4 ana «: = 1) 
2 n n 
i=l 
Tee 

= a 
ns 

i=l 


(arithmetic series formula) 


This result can also be obtained (without calculation) from the fact that the uniform discrete distri- 
bution (of the above form) is symmetric with respect to 7 = (1+ n)/2 (see Problem 12). 
(b) Binomial: 


B(x) = Pi aj (Eq. 115) 


[131] The reader should distinguish between the converse and contrapositive (i.e. of a conditional statement). For a 
conditional statement a — b, the converse is b > a while the contrapositive is b > @ (where the bar means negation). In 
general, the truth of contrapositive follows the truth of the statement (i.e. if a > b is true then b > G is also true) but 
this does not apply to the converse (i.e. if a > b is true then b — a is not necessarily true). By the way, the inverse 
of the conditional statement a —> b is @ —> b which (like the converse) does not follow the truth of the statement (i.e. if 
a — bis true then @ > b is not necessarily true). 

[132] We note that the subscripts in Lug and fy are specific to the geometric and harmonic means to distinguish them from 
other types of mean, and hence they are not subject to our convention about the notation of mean (i.e. arithmetic 
mean) which we outlined earlier in this section. 
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=e 1D; (for binomial x; = 0,1,...,7) 
i=0 


= Sip, (0 is trivial) 


i=l 
=> icPpi(1-p)* (Eq. 71) 
i=l 
ee: Bq. 25 
“i Ga | ~) eae) 
7 n! P oe Pas 15 
= 2, G—Iim—a p(1—p) (canceling 7) 
_ . (n—1)! i-1 n-i : 
= oe GoDIG—ol pe) (taking np out) 
=n > Ce) apa dding and subtracting 1 
= np Cais pi? (1—p) (adding and subtracting 1) 
t=1 
=p Cae Cape} (Eq. 25) 
i=l 
n—1 : 
= np >~ CP pi(1 — py) (shifting the index) 
1=0 
= np(l—pt+p)" (Eq. 33) 
np 


(c) Multinomial: referring to Eq. 73, the k possible outcomes (or variables) in the multinomial 
distribution are mutually exclusive where each one of theses outcomes is subject to the binomial 
distribution (i.e. relative to the rest of these outcomes). In other words, if outcome i (where i = 
1,2,...,&) has a probability p; then the probability of the other outcomes (which correspond to the 
non-occurrence of outcome 7) is 1—_p; and hence p; corresponds to p (i.e. the probability of occurrence) 
in the binomial distribution and 1 — p; corresponds to 1 — p (i.e. the probability of non-occurrence) in 
the binomial distribution.!!34] Accordingly, the mean of the x; outcome in the multinomial distribution 
must be the same as the corresponding binomial distribution, that is: 


Ua) = np; (133) 


A simple verification of this (in a special case) is the formula of the mean of the binomial distribution 
considering that the binomial distribution is a multinomial distribution corresponding to k = 2, ice. 


P(n1,72;P1,P2) = Cy, ngPi' Po” 


It is obvious that the mean of this “multinomial distribution” [which is (x1) = npi] is the same as 
the mean of the corresponding binomial distribution [which isp= np] noting that p, = p, pp =1—p1, 
Cheng = Ch, and ny + nz =n. This should also apply to (x2) = np2 which is the same as the mean 
of the non-occurrence in the corresponding binomial distribution [which iS Mnon—occurrence = N(1 — p)| ; 
So, we conclude that the mean of the variable x; (i = 1,2,...,&) in the multinomial distribution is 


1133] Tf we use the terminology of multivariate distributions (by treating the multinomial as multivariate; see the notes of § 
4.1.3 as well as § 4.4) then we can say: the marginal distribution of each random variable x; (where i = 1,2,...,k) isa 
binomial distribution parameterized by n and pj. 
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given by Eq. 133.[134] 
(d) Poisson: 


(e) Geometric (for brevity and clarity we use g = 1 — p): 


w(x) = Dei Lj 
u(x) = De tpi 


co 
uz) = >_ ia 4p 
t=1 
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Eq. 115) 

for Poisson 2; = 0,1,2,...) 
0 is trivial) 

Eq. 75) 

canceling 7) 


taking \e~> out) 


shifting the index) 


the exponential series) 


(Eq. 115) 
(for geometric x; = 1,2,3,...) 


(Eq. 80 with ¢g = 1-—p) 


(expanding) 


(x) 
u(x) — qu(x) = [p+ 2qp + 3q°p +--+] 
u(x) — qu(r) =p+ap+ap+aqpt-- 
w(x) ~ qu(a) = 
q 
_ Pp 
u(x) = (i—@ 
=f 
w(x) = 2 
1 
u(x) = ? 


[ap + 2q7p + 3q°pt+-:- 


(geometric series) 


PE: Investigate other methods for deriving the mean of the above discrete distributions. 


10. Find the mean of the following continuous distributions: 


(a) Uniform. (b) Normal. 


(c) Exponential. 


[134] Ty fact, if we treat the multinomial distribution as a multivariate distribution then we can use Eq. 125 (as well as its 


extension and generalization) to prove Eq. 133. 
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Answer: 
(a) Uniform: 


+00 
p(x) = i x f(a) dx (Eq. 116) 


b 
=f efa)ae (f =0 for « <aand x > b) 


ee 
=| an dx (Eq. 89) 


_bt+a 
a) 


This result can also be obtained (without calculation) from the fact that the uniform continuous 
distribution is symmetric with respect to « = (b+ a)/2 (see Problem 12). 

(b) Normal: by definition, the mean of the normal distribution is 4 (where ys here means the parameter 
in Eq. 91). So, all we need to do is to apply the definition of mean (i.e. Eq. 116) to see if this is 
consistent or not, that is: 


u(x) =| x f(a) dx (Eq. 116) 
TS 1 (a)? 
= x——e WV dx Eq. 91 
1 aie ean)? 
— ve a de 


= a> (uv2nV) (by calculus) 
=H 


This result can also be obtained (without calculation) from the fact that the normal distribution is 
symmetric with respect to x = pu (see Problem 12). 
(c) Exponential: 


+00 
p(x) =f x f(x) dx (Eq. 116) 
= [eae oO” da (Eq. 93) 
0 


I 


CoO 
af re dx 
0 


[- (1+ oe), 


=a 
a2 
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11. 


12. 


I 
Q 
| 
j=) 
+ 
I 


PE: Investigate other methods for deriving the mean of the above continuous distributions. 

State the law of iterated expectations and prove it. 

Answer: The law of iterated expectations states that if x and y are random variables defined over 
the same sample space then the mean of x is equal to the average of the means of x conditioned on y 
(i.e. for all possible values of y), that is: 


u(x) = wl p(aly)] (134) 


To prove it (using a discrete approach) we have:!'3* 


w[w(aly)) =| So 2 Poly) (Eq. 115) 


=). | ere) 
=> oe P(ely) Py) 
=S°Y\ 2 Plylx) P(x) (Eq. 47) 


P(y) (Eq. 117) 


=o # P(x) Plyle) 

= S- x P(x) S- P(y|z) [no y in x P(a)| 

= S- x P(a) ye P(y|x) = 1 for all x 
= p(2) (Eq. 115) 


PE: Justify in words the above derivation. 

Show that the mean of a probability distribution that is symmetric with respect to « = c is u(x) =c 
(assuming the distribution has a mean). 

Answer: We show this for the case of continuous distribution where f(z) is supposedly a symmetric 
distribution with respect to 7 = c. From Eq. 122 we have (with a = 1 and C' = —c): 


u(x —c) = p(x) —¢€ (135) 


Now, if we use the linear transformation y = x — c then the probability distribution function of y is 
given by (see Eq. 63 in Problem 2 of § 4): 


gy) = flay] a f(y+e) x1=fly+e) 
and thus (see Eq. 116): 
+00 +00 
we-e)=ny)= | yady= fv ty +oay (136) 


[135] From a notational perspective (as well as from other perspectives), this proof is not sufficiently rigorous (noting that the 


purpose of it is to clarify this law and demonstrate the rationale behind it). Anyway, the law of iterated expectations 
is investigated and used marginally in this book and hence this proof should be enough for our purpose. 
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Now, since f(x) is symmetric with respect to « = c then f(y) is symmetric with respect to y = c and 
hence f(y +c) = f(y — [—c]) should be symmetric with respect to y = 0. Accordingly, the integral in 
Eq. 136 must be zero because the integrand is an odd function [since y is odd and f(y +c) is even]. 
Therefore, from Eq. 136 we should have p(a — c) = 0. Hence, from Eq. 135 we get u(x) — c = 0 and 
thus ju(z) = c as required. 

PE: Modify the above argument to fit the case of discrete distribution. 


5.2 Variance and Standard Deviation 


The variance of a numerical data set is a measure of the spread of the data around its average value. More 
technically and specifically, it is the mean of the squared deviation of a random variable from its mean. 

For a discrete real-valued random variable x (which can take distinct discrete values x; with corre- 
sponding probabilities p;), the variance V is given by the sum: 


1 4 
V(t) =- i(tj— pe)? = — (xi — fe)? = i He)? vi = 1 — pz)”) 137 
(x) foe? i) Da? [lx) de Ha)” pi = M [e — per] (137) 
where the symbols are as defined earlier and the last step is based on the definition of mean (see § 5.1 
and Eq. 117 in particular). 

For a continuous real-valued random variable x [which can take continuous values between —oo < x < 
oo with a probability density function f(x)], the variance V is given by the integral: 


V(x) = / ne = pir)? f(x) dx = mG - H21°) (138) 


—oo 


where the symbols are as defined earlier and the last step is based on the definition of mean (see § 5.1 
and Eq. 118 in particular). 

As indicated earlier (and expressed explicitly in the last steps of Eqs. 137 and 138), the variance is no 
more than the mean of the squared deviations from the average (and this should provide a simple way 
for memorizing and recalling the formula of variance as the “mean square”, i.e. the mean of the squared 
deviations). It should be obvious (from analyzing the above discussion and formulae) that large/small 
variance means large/small spread of the data around the mean, and this should provide more clarification 
about our claim earlier that the variance is a measure of the spread of the data around its average value. 

The variance satisfies the following properties (with x,y being random variables defined over the same 
sample space): 


V(C) =0 (C is constant € R) (139) 
V(at+y) =V(«)+V(y) (x and y are independent) (140) 
V(az + b) = a? V(a) (a and b are constants € R) (141) 
V(a+C) =V(ze) (C is constant € R) (142) 
) (143) 
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The last property is known as the law of total variance (among many other names). These properties 
(or some of them at least) can be easily generalized and extended to more variables. For example, Eq. 
140 can be extended to more than two (mutually independent) variables by repetitive application of Eq. 
140 (also see Problem 4). 

For a bivariate probability function (see § 4.4) of random variables x and y the variance of x is given 
by: 


V(x) = ps Dei — py)? P(xi,y;) (discrete variables) (144) 
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+00 +00 
V(x) = i i (a — pr)? f(x,y) dx dy (continuous variables) (145) 


Similar definitions apply to the variance of the variable y, i.e. V(y). 
The standard deviation of a random variable is defined as the (positive) square root of its variance, i.e. 


a(x) = J/V(a) (146) 


Like the variance, the standard deviation is a measure of the spread around the mean. Noting that the 
standard deviation is the square root of the variance (which is a “mean square”), we can exploit this to 
memorize and recall the formula of the standard deviation as the “root mean square”, i.e. the root of the 
mean of the squared deviations. 

There are a few important points to note about the variance (and standard deviation): 
e It should be obvious that for a random variable to have a variance (and standard deviation), the series 
of Eq. 137 (in the case of discrete) and the integral of Eq. 138 (in the case of continuous) should converge. 
e It should be obvious that for a random variable to have a variance (and standard deviation) it should 
have a mean. However, having a mean does not guarantee having a variance (and standard deviation). 
In other words, having a mean is a necessary, but not sufficient, condition for having a variance. 
e Like the mean, the variance (and standard deviation) can be sensibly attributed to the probability 
distribution of the random variable as well as to the random variable itself. 
e Like the mean, the variance (and standard deviation) can be seen as a functional reflecting the charac- 
teristics of the distribution function of its random variable. 
e The variance (and thus the standard deviation) of a given probability distribution may not exist and 
hence the distribution then is classified as pathological (see the Cauchy distribution in § 4.2.4). However, 
if the variance does exist then it is unique (i.e. with respect to a given variable). 
e In the equations that involve more than one random variable (e.g. Eq. 140) we generally assume that 
the random variables are defined over the same sample space. 
e Unlike the mean (which can be positive or negative or zero), the variance and the standard deviation 
cannot be negative. 
e The standard deviation has the same physical dimensions as its random variable, while the variance has 
the squared of the dimensions of its random variable. 


Problems 


1. Find the variance and standard deviation of the data sets of parts (a) and (b) of Problem 1 of § 5.1. 
Answer: We use Eqs. 137 and 146. 
(a) Noting that 4. ~ 12.94 according to part (a) of Problem 1 of 5.1 we have: 


1 2 6 — 12.94)? 9.4— 12.94)? +... 21.7 — 12.94)? 
— 74 (i — He)” ~ mk jai Larashsol ) ~ 24.53 


= 
a 
I 


12 
o(@) = V(a) ~ 4.95 


(b) Noting that sum = 7 according to part (b) of Problem 1 of 5.1 we have (noting that x represents 
the sum): 


V(sum) = SG: = La)? Di = (2 fa x 


4 


o(sum) = /V(sum) = 1/35/6 ~ 2.415 


PE: Find the variance and standard deviation of the data set of the PE of Problem 1 of § 5.1. 
2. Find the variance and standard deviation of the density functions of parts (a) and (b) of Problem 2 of 
§ 5.1. 
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Answer: We use Eqs. 138 and 146. 
(a) Noting that 4. = 8/3 according to part (a) of Problem 2 of 5.1 we have: 


+00 4 2 
8\" x 8 
= fy)? dx = i = 
fe be)” f (a) da Ge =) gat 7 
o(x) = VV(a) = 8/9 ~ 0.943 
(b) Noting that y. = 7 according to part (b) of Problem 2 of 5.1 we have: 
+00 +00 _ 
Vie) = fae) flayae =f w-m SAEDay = F 
o(@) = VV(a) = Vn?/4= 7/2 ~ 1.571 


PE: Find the variance and standard deviation of the density functions of the PE of Problem 2 of § 5.1. 
3. Show that the variance can also be given by: 


V (2) = pla?) — [u(@))? = a2 — w2 (147) 


= 
RL 
I 


Answer: We have: 


V(x) = w( (x = H2]”) (Eqs. 137 and 138) 


= p(x? — 2¢ pe + U5) 

= p(x?) — pw (22 pe) + p(w?) (extension of Eq. 124) 
= pla?) — Q2uep(a) + p2 (Eqs. 119 and 121) 

= p(2?) — Qu? + pe? (definition of ju.) 

= p(x") — we 

= p(x?) — [u(x)]? (definition of pu.) 


Note: this relation is commonly written as V(x) = (x?) — (x)* where the triangular brackets () 
symbolize mean. This form may be easier to remember (i.e. as square-in minus square-out). 

PE: Investigate the possible advantages and disadvantages of using Eq. 147 instead of Eqs. 137 and 
138. Can we conclude from Eq. 147 that for a random variable to have a variance it should have 
a mean p(x) and a mean of its square ju(), i.e. having a mean p(x) is not sufficient for having a 
variance? 

. Give some examples of generalizations and adaptations that can be made to the properties of variance 
(see Eqs. 139-143). 

Answer: For example, if 71,272,...,%, are mutually independent random variables defined over the 
same sample space then we can generalize Eq. 140 to: 


Moreover, if we consider Eq. 141 as well then we get (where a; are constants): 


V (>: a; a) = yy V(a;2;) = yy a? V (aj) 


PE: Investigate other potential generalizations and adaptations to the properties of variance. 
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5. Verify Eqs. 139-143. 
Answer: 
e Regarding Eq. 139 we have (considering discrete and continuous cases together): 


V(C) = u(C?) — [n(C))? = C? - [CP =C? -C? =0 


where we used Eq. 147 in step 1 and Eq. 119 in step 2. 
We may also consider the discrete case specifically: 


1 1 1 
= — ni (C ~ w{Cl)? = — dm (C-Cy = ~ > 0=0 
where we used Eq. 137 in step 1 and Eq. 119 in step 2. 
We may also consider the continuous case specifically: 


+00 +00 
vic)= i, (C— p[Cl)? fle) de = i (C—C)? f(x) dr =0 


—oCo —Co 


where we used Eq. 138 in step 1 and Eq. 119 in step 2. 
e Regarding Eq. 140 we have (considering discrete and continuous cases together): 


V(et+y) =n[(e+y)?] - [ne+y)]’ (Eq. 147) 
= px? + 2ey + y?] — [u(e) + wy)? (Eq. 120) 
= p(a?) + 2p(wy) + u(y?) — [w(a)]” — 2n(x) u(y) — [ey]? (Eq. 131) 
= {u(a?) = [wo)]”} + {1e(u?) = [wa]? + {2(ey) — 2u(x)uw) } 
=V(e) +V(y) + 2{ (oy) — w(e)uty) } (Eq. 147) 
= V(x) +V(y) +2Cov(z,y) (Eq. 132) 


Now, since x and y are independent then we can use the result of Problem 7 of § 5.1 [i.e. Cov(x, y) = 0] 
and hence we conclude that V(x+y) = V(x)+V(y). It is worth noting that if we discard the condition 
that x and y are independent then from the last line we have: 


V(at+y) =V(2) 4+ V(y) + 2 Cov(a, y) 


So, Eq. 140 is a special case of this general equation. 
e Regarding Eq. 141 we have (considering discrete and continuous cases together): 


V(ax +b) = p[(ax + b)?] — [u(ax + d)] : (Eq. 147) 
= pi[a?x? + 2aba + 67] — [ap(x) + 0] ° (Eq. 122) 
= a? u(x?) + 2ab w(x) + p(B?) — a? [u(a)]? — 2ab p(x) — b? (Eqs. 121 and 124) 
= a? u(x?) + 2ab p(x) +B? — a? [p(2)] * _ 2ab pila) — 0? (Eq. 119) 
= apa?) - 2 u(a))? 
=a? { n(x) — [n(x)]"} 
=a’ V(z) (Eq. 147) 


We may also consider the discrete case specifically (see Eq. 137 as well as Eqs. 119-124):|!°4 


Viar+b) = -> Ni [Cans + b) — pax + »)| : = -> Ni [(ax: + b) — ap(x) — p(b) 


a 


2 


[136] Because the variance is a mean then the variance of a function of x follows the style of the variance of «. 
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We may also consider the continuous case specifically (see Eq. 138 as well as Eqs. 119-124): 


V(at+b) = a [(ax +) - wax +d)] f(a) dx = [- [(ax +b) — (aftz + b) ° F(a) dx 


Co —oco 


l| 


oe [ate a H2)] ; f(a) dx = a? [re = ede f(x) dx = ae V(x) 


Co —oCo 


e Regarding Eq. 142, it is an instance of Eq. 141 with a=1 and b=C. 
e Regarding Eq. 143 we have (considering discrete and continuous cases together): 


V(x) = w(x?) — [ala] (Bq. 147) 
= ufn(z?\)] — [nm] (Eq. 134) 
= u[V(oly) + (oly)? ] — [ata] (Bq. 147) 
= u[V(aly)] + #[ be(olw)?] — [ate] (Bq. 124) 
= 1 [V(oly)] + [loCely)? | - [ulaelen] (Bq. 134) 
= nV (cly)] + V[n(aly)| (Eq. 147) 


PE: Justify the steps that we did not justify in the above verifications. 

6. Find the variance of the following discrete distributions: 
(a) Uniform. (b) Binomial. (c) Multinomial. (d) Poisson. (e) Geometric. 
Answer: 
(a) Uniform: we assume the distribution to be of the form P(x;) = P(i) = p; = 1/n (where 
i=1,2,...,n). Accordingly: 


x)? (Eq. 147) 


(Eq. 70) 
= nat] = (=) (Eq. 117) 


( 
= (S27) - (42) (n; = 4 ana «: = 1) 
( ; 


1 
n(n + 1)(2n + 1) l+n\? 5 
x 6 5 formula of S- a 


= @FNCn+) es = 
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2n+1 lin 
es 
won| 
n—-1 
att 
(n | =| 
Weil 
— 7 


” (n — 2)! 


= np — (np) +n(n—1) (i — 2)\(n — 4) 


(Eq. 147) 
(Eq. 72) 
(— and + x) 


(Eq. 124) 


(Eq. 72) 


(Eq. 117) 


(Eq. 71) 


(0 is trivial) 


(Eq. 25) 


[cancel ii — | 


145 


[take n(n — 1) out| 


(take p” out) 
(— and + 2) 


(Eq. 25) 


(shif the index) 


(Eq. 33) 
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(c) Multinomial: we repeat our argument in part (c) of Problem 9 of § 5.1 and hence we conclude 


that the variance of the variable x; (¢ = 1,2,... 


Eq. 72): 


V(z;) = np; (1 Di) 


(d) Poisson: 


V(x) = pa?) — [n()]? 


= p(x") — » 
= =? + (0?) 
= 84D at, 
a Ne 
__)?2 2 
=—X 2 a il 


oe ter 
_ yp 2 
cea a eer 


i=l 
28 Si cee 


co t5—-A 
=- +> (6-141) aie 


fg ING r 
=—-M+ 2X G-) Gay 


, iseed ; Nem 
Seite se) am 


4=2 = 
2 2A ' ; 
gh cer 
1=2 
= 2 20—-X : 
=—)? + | de DS a + 
1=0 
Sa NE Ne hen cL Nem en 
==)? + 7+) 
=X 


| 

Berea 
ly Ao] Ay 

| 
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,&) in the multinomial distribution is given by (see 


(Eq. 147) 
(Eq. 76) 


(Eq. 117) 
(Eq. 75) 

(0 is trivial) 
(cancel 7) 

(— and + 1) 
(distribution) 
(0 is trivial) 
(cancel i — 1) 
(factorize) 


(shift the indices) 


(the exponential series) 


(Eq. 147) 
[— and + p(x)] 
(Eq. 81) 


(Eq. 124) (148) 
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So, what we need now to find V(z) is to find p(x? — x), that is: 


w(a? — 2) = w[2(e—1)] 


a S- ri(x; — 1) py (Eq. 117) 
=) ii-1 qo 'p (Eq. 80 with q = 1 —p) 
i=1 
= p>- i(i—1)q°! (factorize) 
i=1 
aaa 
ae > G=1) 7 (calculus) 
@ leet 
d |, i i acest os 
me rr Ss (i—l1)q | (0 is trivial) 
i=2 
=P # ¢ . @ 1)? (factorize) 
qq| 
_ d {od JO ua 
=p aa [ da > q \ (calculus) 
d] od Jw, dip 
= a [ ae > q \ (shift index) 
_,a@|od q oth 
Fe c aa { bag \ (geometric series) 
Geil oe Sd | 
TPS calculus 
Tce, os 
2q 
=p la = =a (calculus) 
= (1-4) | iz (p=1-q) 
a= 98 fre 
ae! 
(1-4)? 
2(1 — 
=H 2 P) (q=1-p) 
Pp 


On substituting from the last equation into Eq. 148 we get: 
dS Dyed Pace apap LBD 

p a py pe 

PE: Investigate other methods for deriving the variance of the above discrete distributions. 
7. Find the variance of the following continuous distributions: 


V(@) = 


(a) Uniform. (b) Normal. (c) Exponential. 
Answer: 
(a) Uniform: 


V(2) = w(x?) — [ul0)? (Eq. 147) 
Coe () (Eq. 90) 
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=~ (2°) +00) 
a2 (sya fe (x) dx (Eq. 118) 
= (Se) fete (Eq. 89) 
ae ee a 

t 


BN cod b3 — a3 
y: ba 3 
a? + 2ab + b? b? +ab+a? 


4 3 
i 3a? — 6ab — 3b? + 4b? + 4ab + 4a? 
7 12 
_ a — 2ab+b? 
7 12 
_ (b=)? 
=) 1 


(b) Normal: by definition, the variance of the normal distribution is V (where V here means the 
parameter in Eq. 91). So, all we need to do is to apply the equation of variance (i.e. Eq. 147) to see 
if this is consistent or not, that is: 


V(e) = w(2?) — [w(@)]? (Eq. 147) 
= pla") — (see part b of Problem 10 of § 5.1) 
= —p? + p(2*) 
= — +f x? f(x) dx (Eq. 118) 
= —pr+ [- x : e ee dx (Eq. 91) 
beg! “A 2V- 
1 ae (w=) 
es 2 20- z d 
ete] ‘ 
1 
= =p? + VanV (w+V from calculus 
i V2n1V ( ) ( 
=-wW+w4Vv 
=V 
(c) Exponential: 
V(x) = w(x”) — [w(@)]? (Eq. 147) 
1 
2 
= u(2?)- (Eq. 94) 
1 
= + ula”) 
1 ae 
Sao a x” f(x) dx (Eq. 118) 
uf ~ 2 —ax 
=-—++ x ae ° dx (Eq. 93) 
0 
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SS (from calculus) 


PE: Investigate other methods for deriving the variance of the above continuous distributions. 

8. An experimentally-collected data set is given by the list of values {0.12, 1.23, 0.89, 5.60, 3.22} with 
corresponding probabilities {0.15, 0.13, 0.21, 0.33, 0.18}. If the random variable x represents the values 
in the list while the random variable y represents the square root of these values, find the covariance 
and correlation of x and y. 

Answer: We have: 


5 


u(x) = S~ pa; = 2.7924 (Eq. 115) 
5 
uly) = > pi Ve ~ 1.498173 (Eq. 117) 
5 
pay) = Spi asl? ~ 5.773115 (Eq. 117) 
5 
V(x) = So (ai — We)? pi © 4.782792 (Eq. 137) 
° 
Vy) = Soi — My)? pi 0.547877 (Eq. 137) 
Cov(a, y) = p(xy) — p(x) u(y) & 1.589617 (Eq. 132) 
Cov(a, y) 
Cor(x, y) = — 22 ~ 0.981997 Eq. 130 
r(x, y) WWW, (Eq. 130) 


Note: the covariance can also be calculated from its definition (i.e. by using Eq. 129), that is: 


5 
Cov(2,y) = p[(a — pe) (y — My)] = So (ai — Me) (yi — My) Pi & 1.589617 


i=l 


PE: Repeat the Problem where y now represents the cubic root of these values. 
9. Verify the following (noting that x, y, 21, 22, y1, y2 are random variables and aj, dg, bi, be are constants): 


(a) Cov(z, x) = V(x). (b) Cov(a, y) = Cov(y, x). 
(c) If yr =aia, +b, and yo =agrgtbo then Cov(yi,y2) = a1a2 Cov(x1, £2). 
Answer: 
(a) We have: 
Cov (2,2) =p] (2 — te) (@ — H2)] (Eq. 129) 

= p|(w = H2)?| 

= V(z) (Eqs. 137 and 138) 
(b) We have: 

Cov(e, y) =p] ( — tn) (y — My)| (Eq. 129) 


= 1[(y- Hy) (@ - H2)| 
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= Cov(y, x) (Eq. 129) 
(c) We have: 


Cov(y1, yo) = L] (yt — yr) (ye — Hy2)] (Eq. 


f 
. {ayx1 +b} —- {ai p( (a1) + b}) ({aoe + bo} = {a241(r2) +02})| (Eq. 


=U (an - ay pel 1)) (a2e2 - can(e2))| 


| 
= 41a ale = (21) («2 = u(e2))| (Eq. 


= ana | (1 2 Ho) (x2 - bse) 


= a1 42 Cov(21, £2) (Eq 
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129) 


122) 


121) 


. 129) 


Chapter 6 
Useful ‘Theorems 


In this chapter we investigate some useful theorems (or laws or results or ...) in the probability theory 
which are commonly met in the literature of probability, and hence the reader should have some awareness 
and understanding of their content, significance and application. In fact, some of these theorems (such 
as the Bayes theorem) are fundamental and central to the probability theory at all levels and for very 
wide range of topics and applications, and hence detailed awareness and deep understanding of them are 
essential for a comprehensive investigation of the probability theory. 


6.1 Bayes Theorem 


The Bayes theorem (which is about conditional probability and may also be called Bayes rule among other 
names and labels) is given by: 


P(A) P(B|A) 
P(B) 
where P(A) is the prior probability of event A (i.e. the probability of event A before the occurrence 
of event B), P(A|B) is the posterior probability of event A (i.e. the probability of event A after the 
occurrence of event B), P(B) is the probability of event B and P(B|A) is the conditional probability of 


event B. Bayes rule (i.e. Eq. 149) can be simply obtained from Eq. 47 (or rather it is a manipulated 
form of Eq. 47). 


Problems 


P(A|B) = [P(B) f 0| (149) 


1. Prove the following equation: 


P(B) = )) P(Ai) P(BIAi) (150) 
i=1 
where A; (¢ =1,...,n) are mutually exclusive random events whose union is the entire sample space, 


and B is a random event in this sample space. 
Answer: We proved this earlier (see part b of Problem 14 of § 3.3). In brief: 


P(B) = 2 P(BN AY) = > PLAY) PBA) 


where step 1 is because A,;’s are mutually exclusive and their union represents the entire sample space 
(and hence the intersections BM A; are mutually exclusive and represent all parts of B; see Eqs. 7 and 
53), while the second step is from Eq. 46. 
PE: It is claimed that Eq. 150 is an instance of the Bayes rule (i.e. Eq. 149) despite the fact that Eq. 
150 is about unconditional probability li-e. P(B )] while Eq. 149 is about conditional probability li-e. 
P(A|B)]. Try to justify this claim. 

2. Show that the Bayes theorem can be written as: 


2 P(A) P(B|A) 

P(AIB) = Ss" pA) P(BIAD 

ae P(A) P(B|A) (152) 
P(A) P(B|A) + P(A) P(B/A) 


(151) 


151 
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P(A|B) P(A) P(B|A) 
= (153) 
P(C |B) P(C) P(B|C) 
P(B|A) _ P(C)P(A|B) (154) 
P(B|C) ~———~P(A) P(C|B) 
where A, B,C, D are events in the sample space and A,’s (in Eq. 151) are mutually exclusive events 
137] 


whose union is the sample space. 

Answer: 

e Eq. 151 is the same as Eq. 149 with P(B) given by Eq. 150. 

e Eq. 152 is a special case of Eq. 151 (where the sample space is divided into A and A). 

e Eq. 153 is obtained by dividing P(A|B) by P(C |B) where these probabilities are obtained from Eq. 

149. 

e Eq. 154 is obtained by multiplying the two sides of Eq. 153 by P(C)/P(A). 

Note: each one of the above forms of the Bayes theorem (as well as the original form of Eq. 149) 

has its own uses and applications where a given form is either necessary or more appropriate to use 

depending on the cases and circumstances (which are determined, for instance, by the available data 

and information or by convenience). 

PE: Give some examples where Eqs. 151-154 are necessary or convenient to use instead of Eq. 149. 
3. The Monty Hall problem (which is controversial and may be classified as a paradox) is regarded as an 

instance for the application of Bayes rule. The problem is stated as follows (which we quote): 

Suppose you’re on a game show, and you’re given the choice of three doors: Behind one door is a car; 

behind the others, goats. You pick a door, say No. 1, and the host, who knows what’s behind the 

doors, opens another door, say No. 3, which has a goat. He then says to you, “Do you want to pick 

door No. 2?” Is it to your advantage to switch your choice? (End of quote) 

Solve the Monty Hall problem by applying the Bayes theorem. 

Answer: Let adopt the following: 

e A is the event that “the contestant has chosen the car door”. 

e B is the event that “the host reveals a goat door”. 

e P(A|B) is the (posterior) probability that the contestant has chosen the car door given that the 

host reveals a goat door. This is the probability that we want to find. 

e P(A) is the (prior) probability that the contestant has chosen the car door. This probability is 

obviously 1/3 since all three doors are equally likely to be the car door. 

e P(B|A) is the (conditional) probability that the host reveals a goat door given that the contestant 

has chosen the car door. This probability is obviously 1 since by the rules of the game the host cannot 

reveal the car door in any circumstance and under any condition during the game. 

e P(B) is the probability that the host reveals a goat door. Again, this probability is obviously 1 since 

the host cannot reveal the car door. 

So, from Eq. 149 we get: 


P(A) P(B|A) _ (1/3)x1 1 
P(B) 4 d. 3 


P(A|B) = 


This means that the contestant has a 1/3 chance of winning the car if he keeps his choice, and hence he 
has a 2/3 chance of winning the car if he switches his choice (noting that after the host reveals a goat 
door “keeping his choice” and “switching his choice” are complementary). So, it is to the advantage of 
the contestant to switch his choice. 

PE: Although the theory of probability is generally based on common sense, the result of the Monty 
Hall problem may not seem (to some) consistent with common sense.!!*8] Try to explain and justify 
this by investigating what is weird about this result. 


1137] We note that A in Eq. 151 is not the union of A,’s. 
[138] Th my view, the result if not intuitive is at least not counter-intuitive (although this judgment may be based on experience 
rather than instinctive intuition and common sense). 
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Table 7: The table of Problem 4 of § 6.1. DoC means “Door of Car” and CoC means “Choice of Contestant”. 


DoC DoC 


CoC 
No| 
oh 
ice 
| bo] Ge 


CoC 

i) 
A) Nes] 
Jee] yy 
a NN os 


4. Solve the Monty Hall problem (see Problem 3) by simple count. 
Answer: In Table 7 we have two sub-tables. The columns in both these sub-tables represent the 
number of the door behind which the car is located, while the rows in these sub-tables represent the 
choice of the contestant about the number of the door. 
The entries inside the cells of the left sub-table represent the choice of the host (i.e. the door which he 
opens after the initial choice of the contestant). For example, the cell in column 1 and row 1 means 
that if the car is behind door 1 and the contestant chooses door 1 then the host will open either door 
2 or door 3, while the cell in column 3 and row 2 means that if the car is behind door 3 and the 
contestant chooses door 2 then the host will open door 1. 
The entries inside the cells of the right sub-table represent the result of the game show if the contestant 
switches his choice where VY means win (for the contestant) and ¥ means loss. As we see, there are 6 
possibilities (out of 9) for the win and hence we can conclude that the contestant has a 2/3 chance of 
win if he changes his initial choice. So, it is to the advantage of the contestant to switch his choice. 
PE: Demonstrate and justify the patterns in Table 7. 

5. Solve the Monty Hall problem (see Problem 3) by simulation. 
Answer: We solved the Monty Hall problem by simulation using C++ programming language (see 
MontyHall.cpp code). The results are similar to the results of the previous Problems. The simulation 
result improves as the number of runs increases and it converges to the theoretical result of Problem 
3 as the number of runs becomes large. 
PE: Plot a flowchart representing the algorithm of the MontyHall.cpp code. 

6. Make an argument in support of the result obtained in the previous Problems about the Monty Hall 
problem. 
Answer: We present in the following an argument in support of the result obtained in the previous 
Problems: 
Let identify the three doors as a, 3 and y. When the contestant chooses a particular door (say door 
a) he has a 1/3 chance of winning. This means that there is a 2/3 chance that the car is behind 6 or 
y. Now, the host cannot choose door a (because it is already chosen by the contestant) and hence the 
host has no access to this door. This means that the host has no ability to change its initial probability 
of 1/3 because the probability of door a can be changed only if it is accessible by the intervention of 
the host and can be affected by it. So, when the host chooses one of the two remaining doors (i.e. 3 
or Y) he will certainly not choose the door of the car and he cannot choose door a. Accordingly, when 
the host opens a door (which is certainly a goat door) the entire 2/3 probability of the two remaining 
doors will transfer to the door that is not chosen by the contestant or by the host. This means that 
the door chosen neither by the contestant nor by the host has a 2/3 probability of being the car door, 
and therefore it is to the advantage of the contestant to switch his choice. 
To sum up, the prior probability P(A) is 1/3 and there is no reason for this probability to change after 
the host reveals a goat door, i.e. P(A|B) = P(A). This means that the contestant has a 1/3 chance 
of winning if he keeps his original choice and a 2/3 chance of winning if he changes his original choice 
and hence it is to his advantage to switch his choice. 
PE: Assess the argument given in the answer of this Problem, pointing out to any potential challenges 
to this argument. Also, make (if you can) another argument in support of the result obtained in the 
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previous Problems or (at least) improve the argument given in the answer of this Problem. 

7. Let have 3 urns uj, U2, u3 each of which contains 5 balls. The balls of u; are all white, the balls of 
ug are 4 white and 1 black, and the balls of ug are 3 white and 2 black. One of these urns is selected 
randomly (without identifying which urn is) and 3 balls from it are drawn where they are found to be 
all white. What is the probability that the selected urn is u1? 

Answer: This Problem can obviously be solved by Bayes theorem. So, let adopt the following: 

e A, is the event that “the selected urn is u,”. 
e B is the event that “the 3 drawn balls are all white’. 

e P(A, |B) is the (posterior) probability that the selected urn is ui given that the 3 drawn balls are 
all white. This is the probability that we want to find. 

e P(A;) is the (prior) probability that the selected urn is u;. This probability is obviously 1/3 since 
all three urns are equally likely to be selected (noting that the selection is random). 

e P(B|A;) is the (conditional) probability that the 3 drawn balls are all white given that the selected 
urn is u;. This probability is obviously 1 since u; contains only white balls. 

e P(B) is the probability that the 3 drawn balls are all white. This probability is not obvious and hence 
we need to calculate it. However, it should be obvious that P(B) represents the number of possibilities 
of “drawing 3 white balls” from all the three urns divided by the total number of possibilities of “drawing 
3 balls” from all the three urns. Now, the number of possibilities of “drawing 3 white balls” from all 
the three urns is (noting that C},C{,C3 correspond respectively to uz, U2, U3): 


C8+034+C3 =104441=15 
while the total number of possibilities of “drawing 3 balls” from all the three urns is: 
C2 +02 +C2 =10+10+10=30 


Accordingly, P(B) = 15/30 = 1/2. 

So, from Eq. 149 (with A, here corresponding to A in Eq. 149) we get: 

P(A:) P(B|Ar) _ (1/3) x1 2 
P(B) Sr tes as 


P(A; |B) = 


PE: Repeat the Problem assuming this time that each of u;, uz, ug contains 6 balls where the balls of 

uy, are all white, the balls of wz are 5 white and 1 black, and the balls of wz are 4 white and 2 black. 
8. Referring to Problem 7, what is the probability that the selected urn is (a) uz (b) u3? 

Answer: 

(a) If we repeat the argument of Problem 7 [noting that in this case we use Ag to represent the event 

that “the selected urn is uy” and hence P(B|A2) = C$/C} = 4/10], then from Eq. 149 (with A» here 

corresponding to A in Eq. 149) we get: 

P(A2) P(B|A2) _ (1/3) x (4/10) _ 4 


oe P(B) Y 1/2 ~ 15 


(b) If we repeat the argument of Problem 7 [noting that in this case we use Az to represent the event 
that “the selected urn is u3” and hence P(B|A3) = C?/C3 = 1/10], then from Eq. 149 (with A3 here 
corresponding to A in Eq. 149) we get: 

P(Az) P(B|As) _ (1/3) x (1/10) _ 1 


AE P(B) ~ 1/2 ~ 15 


We may also obtain this probability more simply from the previous probabilities, i.e. 1—(2/3)—(4/15) = 
1/15 [noting that (A; |B), (Az |B), (A3 |B) are complementary events]. 
PE: Repeat the Problem for the PE of Problem 7. 
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9. 


10. 


11. 


Confirm the results of Problems 7 and 8 by simulation. 

Answer: We simulated those Problems using C++ programming language (see Urn3W1.cpp code for 
Problem 7 and Urn3W2.cpp and Urn3W3.cpp codes for Problem 8). The results are similar to the 
results of those Problems. The simulation results improve as the number of runs increases and they 
converge to the theoretical results of Problems 7 and 8 as the number of runs becomes large. 

PE: Try to modify the Urn3W1.cpp, Urn3W2.cpp and Urn3W3.cpp codes for the PE of Problems 7 
and 8. 

Referring to Problem 7, if we draw a fourth ball (following the drawing of 3 white balls) what is the 
probability that this ball is (a) white (b) black? 

Answer: 

(a) Let adopt the following: 

e A, (¢ = 1,2,3) is the event that “the fourth ball drawn from urn u, is white” (i.e. following the 
drawing of 3 white balls from the selected urn). 

e By, (i =1,2,3) is the event that “the selected urn is u,” (i.e. following the drawing of 3 white balls 
from the selected urn). 

The required probability P,, is the sum (over i = 1,2,3) of the (exhaustive) probabilities of “getting a 
fourth white ball from urn u; given that 3 white balls were drawn from the selected urn”. The sum is 
justified by the fact that these events (i.e. “getting a fourth ... selected urn”) are members of a union 
of disjoint events (because they correspond to different urns) and hence their probabilities are subject 
to the addition law of probability for mutually exclusive events (see Eq. 53). 

Now, each one of these three probabilities is the product of the probability of “the fourth ball drawn 
from urn u; is white given that the selected urn is u,;” times the probability that “the selected urn is 
u;”. The product is justified by the fact that we are looking for the probability of intersection of A, 
and B,,, (since we are looking for “A,,, AND B,,,”) and hence it is subject to the multiplication law for 


associated events (see Eq. 46). Accordingly, we get:!189! 
3 3 
Po = So PAu N Bu) = S PAs Bu;) P(Bu;) 
i=l i=1 


= P(Au,|Bu,) P(Bur) + P(Auz |Buz) P(Buz) + P(Aus |Bus) P(Bus) 


= pea fee acne era ee 
~ 3 2 15 15/5 


(b) The events of “getting white” and “getting black” in the fourth draw are complementary and hence 
the required probability is P, = 1— Py = 1 - (4/5) = 1/5. To check, we follow the method of part a 
(with replacement of “white” by “black”), that is: 


2 1 4 1 1 
r= (0x5) +(5x a) t (1x5) =3 


PE: Repeat the Problem for the PE of Problem 7. 

Resolve Problem 10 by using Eq. 150. 

Answer: 

(a) Let the symbols of Eq. 150 represent the following: 

e A; (¢ = 1,2,3) is the event that “the selected urn is wu,” (ie. following the drawing of 3 white balls 
from the selected urn). 

e B is the event that “the fourth ball is white” (i.e. following the drawing of 3 white balls from the 
selected urn). 


[139] We note that P(Au, |Bu,) = 1 because all the balls of ui are white, P(Au, |Bu,) = 1/2 because after drawing three 


white balls w2 contains only one black ball and one white ball, and P(Aus |Bus) = 0 because after drawing three white 
balls ug contains only two black balls. Regarding P(Bu,;) (4 = 1,2,3) they were calculated in Problems 7 and 8, i.e. 
P(Bu,) = P(A; |B). 
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Now, from the results of Problems 7 and 8 we have: 


2 4 1 
P(A) = = P(Ag) = = P(A3) = = 
(Ai) = 5 (Ao) = (As) = = 
Moreover, from the discussion of Problem 10 we have: 
1 
P(B|A1) =1 P(B|Ap) = = P(B|As) =0 


2 
Hence, from Eq. 150 we get: 


i=l 


(b) If B means “the fourth ball is black” (i.e. following the drawing of 3 white balls from the selected 
urn) then P(B) = 1— P(B) =1- (4/5) =1/5. Alternatively: 


= ee 2 4 1 1 6 1 

a G 0) + (= x 5) +(3 1) aye 
PE: Compare the method of solution of the present Problem to the method of solution of Problem 10 
and comment. 

12. Confirm the results of Problem 10 by simulation. 

Answer: We simulated that Problem using C++ programming language (see Urn4thW.cpp and 
Urn4thB.cpp codes). The results are similar to the results of Problem 10. The simulation results 
improve as the number of runs increases and they converge to the theoretical results of Problem 10 as 
the number of runs becomes large. 

PE: Try to modify the Urn4thW.cpp and Urn4thB.cpp codes for the PE of Problem 10. 

13. We have two boxes: 6; and bz where 6; contains r; red cards and g; green cards while bz contains rz 
red cards and gp green cards. We draw a card randomly from 6b; and put it in bg. We then draw a 
card randomly from bg. What is the probability that the card drawn is red? 

Answer: This Problem can obviously be solved by Bayes theorem. In fact, we will use Eq. 150. So, let 
adopt the following: Ai, A2, R are the events (correspondingly) of drawing a red card from b,, drawing 
a green card from 6), and drawing a red card from b2. So, what we want to find is P(R). Now: 


TL gi TQ + 1 TQ 
P(A,) = P(A) = P(R|A,) = ——2— P(R|A2) = —————— 
(Ar) r+ (Az) rr + 91 (BlAy) rot go+1 (R42) ro+go+1 
Hence, from Eq. 150 we get: 
P(R) = > P(Aj) P(R|Aj) = P(A1) P(R|A1) + P(A2) P(R|A2) 


71 
7: ( ry )( rot1 )+( nN )( r2 ) ry trire + gig 
r+g) \rot+g2+1 rithn/ \rat+get+1 (71 + 91) (r2 + g2 + 1) 


PE: Justify the use of Eq. 150. 


6.2 Limit Theorems 


There are many limit theorems related to probability and probability distributions (as well as associated 
random variables and their parameters) which are frequently used in derivations and calculations in the 
probability theory and related subjects. Some of these theorems have been met earlier in this book. In 
the following subsections we briefly discuss some of the most common of these theorems. 
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6.2.1 Stirling Formula Theorem 


This is related to the approximation of the factorial by the Stirling formula which we discussed earlier 
(see § 2.3.2). As a limit theorem, this formula can be written as: 


lim n! = V2rnn” e” (155) 


n—- co 


6.2.2. Theorems about Convergence of Distributions 


There are many theorems related to the convergence of some probability distributions to other probability 
distributions under certain conditions and in special circumstances (where the converging and converged- 
to distributions could be both discrete or both continuous or mixed). These limit theorems include for 
instance: 

e The convergence of the binomial distribution to the Poisson distribution under certain conditions (see 
Problem 1). 

e The convergence of the binomial distribution to the normal distribution under certain conditions (see 
Problem 1). 

e The convergence of the Poisson distribution to the normal distribution under certain conditions (see 
Problem 1). 

e The convergence of the hypergeometric distribution to the binomial distribution under certain conditions 
(see Problem 5 of § 4.1.6). 

These theorems (and similar theorems) were investigated or indicated earlier in various places of chapter 
4 and hence they do not require further investigation. In fact, there are other limit theorems about the 
convergence of some distributions to other distributions which we did not mention or investigate (e.g. the 
convergence of the negative binomial and hypergeometric distributions to the Poisson distribution under 
certain conditions). We may also include (loosely and in a rather different sense) in this type of limit 
theorems the propositions about the special cases (e.g. “The Bernoulli distribution is a special case of 
the binomial distribution corresponding to n = 1” or “The geometric distribution is a special case of the 
negative binomial distribution corresponding to r = 1”). 


Problems 


1. Outline the “convergence theorems” !!4°l of the binomial, Poisson and normal probability distributions. 
Answer: 
e Binomial to Poisson: as n tends to infinity and p tends to zero (with = np remaining finite and 
constant), the binomial distribution converges to the corresponding Poisson distribution (with = A). 
Also see Problems 2 and 3. 
e Binomial to normal: as n tends to infinity and p remains finite (so 4 = np becomes very large), 
the binomial distribution converges to the corresponding normal distribution [with x =k, w= np and 
V =np(1— p)]. 
e Poisson to normal: for large k and X the Poisson distribution converges to the corresponding 
normal distribution (with « =k and pp = V = A). 
Note: the last statement (i.e. about the convergence of Poisson to normal) is what we found in the 
literature (noting that some may not impose the condition of large X). However, from the comparison 
of Problem 10 of § 4.2.2 we can see that these conditions are not sufficient (and may not even be 
necessary in some cases). In our view, if the Poisson distribution should converge to the normal 
distribution (considering that both can be seen as limits to the binomial, as outlined in the first two 
theorems, and hence we can take the binomial as a reference for their convergence) then we should 
have w = A ~ np and V = A ~ np(1 — p) which on comparison should lead to 1 — p ~ 1 which means 
that p must be small. In fact, the result of of Problem 10 of § 4.2.2 (as seen in Figure 26) should 
support this condition. We should also remember (see Problem 1 and § 4.1.4) that for the Poisson 


1140] The “convergence theorems” is a term that we use to label these theorems (which may also be labeled by some as “the 
laws of large numbers”). So, this is not a standard term and hence it should not be confused with similar terms found 
in the literature. 
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distribution to be a limiting case (and hence a good approximation) to the binomial distribution we 
should have p — 0 (as well as n > oo) and hence the condition of small p can also be obtained from 
the condition of p — 0. Also see Problems 3 and 4. 

Anyway, the reader should be aware that there is some mess and lack of clarity (as well as potential 
contradiction and lack of sufficient details) about some of the limit theorems related to the convergence 
of some distributions to other distributions, and hence the reader should be generally cautious about 
this issue. In fact, from our personal experience we found many cases in which some of these theorems 
fail in certain circumstances, and this should indicate that the given conditions are inaccurate or 
insufficient or not general. 

PE: What can you conclude from the above convergence theorems? 

2. Show that the binomial distribution converges to the Poisson distribution when the number of trials 

n becomes large and hence the probability of occurrence p becomes small (assuming A = np remains 
constant). 
Answer: For the binomial distribution 4: = np and for the Poisson distribution ~ = A. Hence, if we 
have to compare these distributions sensibly (by assuming that they give similar results according to 
the requirement of convergence) then we should take \ = np and hence p = \/n.!'4!] Now, if we write 
Eq. 71 (of binomial) in terms of this expression of p then we have: 


ney = (8) 1-8)" 2) (3) 


n 
_ nx(n-1)x--x(n—k+1) M (,_d ook 
. nk k! n 
- _ k n-k 
x my (n ym k+1) A ; aN 
n n n k! n 


mag HDT A)" (Ay 


Now, if n becomes large (with & fixed) then all the fractions inside the square brackets (in the last 
line) tend to unity, and this also applies to the last factor (noting that ) is fixed).!'47] Accordingly, 


we get: 
k n k 
P(k)~ x (1 *) ~ e e* (large n) 


where the second step is based on a well known result from calculus.!!43] As we see, this is the same 
as Eq. 75. 

Note: the statement in this Problem may be given more rigorously as follows: when n approaches 
infinity and p approaches 0 such that np remains finite and constant, the binomial distribution (parame- 
terized by n and p) converges to the corresponding Poisson distribution (parameterized by \ = uw = np). 
In the following we prove this statement (more rigorously) using the technique of limits. 

From calculus we have (with k kept finite): 


lim Lae] = lim ate eee) 


noo k)! nk n—0o nk 


I 
| ee: | 
3|3s 
x 
| 
= 
x 


1141] Tm fact, we should also consider the condition \ = np(1 — p) which we indicated earlier. However, the condition p > 0 
should ensure that (1—p) tends to unity (although this applies to the limit but not necessarily to cases of approximation). 
[142] We note that the condition “the probability of occurrence p becomes small” (which we stated above) is considered 
implicitly in p = A/n noting that A is supposedly fixed. 
[143] We refer to the following relation: 
e* = lim (1+ =)" 
n 


n—-oco 
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Table 8: The table of Problem 3 of § 6.2.2. 


n |kJ]| p | A | Binomial | Poisson | % Difference 
50 1 | 0.02 | 1 0.3716 0.3679 1.0 

100 2 | 0.01 | 1 0.1849 0.1839 0.5 

150 6 | 0.02 | 3 0.0499 0.0504 1.0 

200 7 | 0.05 | 10 0.0896 0.0901 0.6 

300 5 | 0.02 | 6 0.1617 0.1606 0.7 

1000 | 14 | 0.01 | 10 0.0520 0.0521 0.1 


= lim (=) x lim (S=*) xo lim (S=5**) a 1xixexiat 
nooo \n noo n n—->oco n 


Moreover, if we note that p = A/n (since \ = np) and k is finite then we have (using calculus again): 


n—-k n —k 

lim (1—p)"-* = lim (1 = *) = lim (2 = *) (1 = *) | 
noo noo van noo nN n 

r r 


Now, if we substitute these results into the equation of the binomial distribution (see Eq. 71) consid- 
ering the above assumptions we get: 


! nin* 
Pk = nk 1— n-k _ n k {= n-k _ 1 n—-k 
(k) Crp’ (1 — p) Kk!” (1—p) Fa a (1 —p) 
n! a n—k ae Pace. Me ek i se Je 
~ (<p) Re eB (=) CE 


which is the same as the Poisson distribution (see Eq. 75). 
PE: Explain and justify each step in the above derivations. 

3. Calculate the Poisson probability corresponding to a number of binomial cases (where p of the binomial 
is small considering a number of values of n) and hence compare the two distributions showing that 
the Poisson distribution is a good approximation to the binomial distribution in these cases. 
Answer: We made such a comparison in Table 8. 

Note: the results of this Problem indicate that the approximation is good when p is small even when 
n is relatively small, and this is consistent in part with our observation in the note of Problem 1. 
PE: Do more comparisons using different values of n and larger values of k and p. Try to make some 
observations on how these changes affect the results. 

4. Make plots of the binomial distribution for n = 140 with p = 0.05, 0.20, 0.40, 0.60, 0.80, 0.95 and their 
corresponding Poisson distribution. 

Answer: See Figure 28. 

PE: Do the following: 

e Comment on the effect of varying p on the quality of agreement between the binomial distribution 
and the corresponding Poisson distribution (considering the notes of Problems 1 and 3). 

e Investigate the effect of varying p on properties other than the quality of agreement between the 
binomial distribution and the corresponding Poisson distribution. 

e Try to make similar plots for different n and hence investigate the effect of varying n on the quality 
of agreement between the binomial distribution and the corresponding Poisson distribution (as well as 
other potential effects). 
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Figure 28: Comparison between the binomial distribution (solid curve) and the corresponding Poisson 
distribution (filled circles) for n = 140 with p = 0.05 (top left), p = 0.20 (top right), p = 0.40 (middle 
left), p = 0.60 (middle right), p = 0.80 (bottom left) and p = 0.95 (bottom right). For the corresponding 
Poisson distribution we have \ = np in each case. The horizontal axis in each frame represents k and the 
vertical axis represents the probability of the distribution P. See Problem 4 of § 6.2.2. 
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6.2.3. Central Limit Theorem 


The central limit theorem can be seen as a generalization of other (more special) limit theorems (such 


as some of those indicated in § 6.2.2). According to this theorem if X1, X2,..., Xp are independent ran- 
dom variables represented by probability functions 1, fo,..., fn with means j1, j12,..., Un and variances 
Vi, V2,..., Vn then the “mean” Z of these random variables, i.e. 
1 n 
Z=- > x (156) 
— 


possesses the following properties: 
(a) The mean of Z is the average value of ju, f12,..., Un, that is: 


1 n 
Z)=— i 157 
MZ)= (157) 
(b) The variance of Z is the average value of Vi, V2,...,V, divided by n, that is: 
1oY lg 
ViZ)=—) =a (158) 


(c) As n goes to infinity, the probability distribution function of Z [i.e. g(Z)] tends to a normal distribution 
with mean fiz = p(Z) and variance Vz = V(Z), that is: 


Ie Sone 
In the following points we discuss briefly some issues related to the central limit theorem: 

e The central limit theorem is stated in the literature in various shapes and forms and with different flavors 
and terminologies (e.g. some represent probability approach and others represent statistics approach, and 
some are more general than others). So, the above statement is just one of these variants. In fact, there 
may even be differences in significance and content (i.e. we have central limit theorems). Therefore, those 
who have special interest in this theorem should determine the context and purpose of this theorem which 
they want so that they choose the statement that is more suitable to their objectives. 
e It is obvious that if the central limit theorem should apply then the probability functions f;, fo,..., fn 
must have means and variances (and hence pathological distributions, such as the Cauchy distribution, 
should be excluded from the domain of this theorem). 
e The central limit theorem should explain (in part) what may have been noticed (or guessed) previously 
that probability distributions (such as binomial) generally converge to the normal distribution when n 
becomes large (within certain conditions). The details of this should be sought in the literature. 


ez (159) 


6.2.4 Large Numbers Theorem 


The essence of this theorem (which is commonly known as the law of large numbers) is that as the sam- 
ple size of a random variable increases, its mean becomes closer to the mean of the entire population.|!44! 
This should sound reasonable because as the sample increases in size it becomes more representative of 
the population and hence its mean approaches the mean of the population. It is noteworthy that the 
Bernoulli theorem is a special case of the law of large numbers (see Problem 1). In fact, there are many 
details and theorems related to the law of large numbers,!!°! so the interested reader should consult the 
literature about this issue. 


[144] We are assuming the sample is not biased. 

[145] Accordingly, we can say we have “large numbers theorems”, i.e. plural. In fact, even the theorem of Stirling (see § 
6.2.1) and the theorems of convergence (see § 6.2.2) and their alike may be classified in this category and hence they are 
labeled as “large numbers theorems”. However, the reader should be aware of the difference in meaning between these 
labels (as well as the difference in significance and essence between the intended theorems) to avoid confusion. 
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Problems 


1. Outline the Bernoulli theorem. 
Answer: According to the Bernoulli theorem, the relative frequency of a given outcome (say 
success) in a number of Bernoulli trials tends to the probability of that outcome (i.e. success) as 
the number of trials tends to infinity. For example, in a sequence of (fair) coin-tossing trials the 
probability of H (i.e. getting head) is 1/2, but in a limited number of trials the relative frequency 
does not necessarily reflect this probability. So, if we toss the coin 4 times we may get H three times 
(representing a relative frequency of 0.75) rather than the “expected” two times (which is the frequency 
of occurrence that reflects the probability of 1/2 for getting H). However, as we increase the number of 
trials we will see that the relative frequency fluctuates around this probability and converges towards 
it. So, in a 100 trials we may get 45 H’s (with a relative frequency of 0.45 which is closer to 1/2 than 
0.75), and in a 1000 trials we may get 511 H’s (with a relative frequency of 0.511 which is closer to 
1/2 than 0.75 and 0.45). As indicated above, the Bernoulli theorem is a special case (or an instance 
or an application) of the law of large numbers (depending on how it is stated). 
PE: The Bernoulli theorem may be stated in terms of the average of the outcomes (rather than the 
relative frequency as we did above): 
(a) State this theorem in terms of the average. 
(b) Express your statement (or our statement) of the theorem mathematically as a limit relation, i.e. 
limn3 3 (042) eed), 


6.3. Inequality Theorems 


These are probability theorems in the form of inequalities. Most of these theorems are widely used and 
have many applications in the probability theory and related fields of mathematics and science. In fact, 
there are many inequality theorems (noting that some of these theorems are not restricted to probability 
but they have versions or instances related to probability or tailored for it). In the following subsections 
we present a few of these inequality theorems. 


6.3.1 Markov Inequality 


According to this theorem, if x is a random variable that takes only non-negative values and has a mean 
u(x), then for any real number c > 0 we have: 


P(r >e) < H (160) 


For example, if x is a random variable that meets the above conditions then we know from this theorem 
without doing any substantial calculations (i.e. by just setting c = 3y in the above equation) that the 
probability of x being not less than three times its mean is not larger than 1/3. In fact, this theorem is 
very handy in many practical situations where quick estimates of probability and the limits on its value 
are required. The theorem is also useful in many theoretical arguments and analytical derivations related 
to probability (see for instance Problem 1 of § 6.3.3). 


Problems 


1. Prove the Markov inequality (considering the discrete case). 
Answer: We have: 


playa >. pix (Eq. 115) 
i 
= S> DiTe + Se pea 
Ui<Cc ui>c 
2 SS Di Lj (1% sum > 0 since 2; are non-negative) 


“irc 
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> S- CDi (a; > c in this sum) 
“irc 
=c¢ Se Di (c is constant) 
ui>c 
=cP(x> 0c) (addition rule for disjoint events) 
Hla) P(a >) (c > 0) 


PE: Prove the Markov inequality (considering the continuous case). 
2. Show that the result of part (a) of Problem 2 of § 5.1 is consistent with the Markov inequality. 
Answer: From the left hand side of Eq. 160 we have (with reference to Problem 2 of § 5.1): 


2 


4 4 2 
16 — 
Pese=f as = |= = a (0<c<A4) 


while from the right hand side of Eq. 160 we have (with reference to Problem 2 of § 5.1): 


u(x) _ 8 
Cc 3c 


(0<c<4) 


Now, it is straightforward to show (e.g. by calculus or by a plot using for instance a spreadsheet) that 


2 
OSCE = & is non-positive for 0 <c< 4 and hence P(x > c) < ule) 


16 
PE: Repeat the Problem for part (a) of the PE of Problem 2 of § 5.1. 


6.3.2 Chebyshev Inequality 


The essence of the Chebyshev inequality is that if x is a random variable with mean y and standard 
deviation o and k is a positive real number (> 1)!"46 then the probability that « differs from jz by more 
than k standard deviations is < 1/k?, that is: 


P(|o—ul > ko) < 7 (161) 


For example, the probability that « deviates from ys by more than 20 is < 1/4 (corresponding to k = 2). 
Problems 


1. Prove the Chebyshev inequality (considering the discrete case). 
Answer: From Eq. 137 (noting that V = 0? and pz = uz) we have: 


a =) (ai — 4)? pi (162) 


For the values of x for which |x — | > ko the sum in the last equation should exclude (in general) 
some terms and hence we can write: 


0° > D(a; — HW)? p; 
j 


where j refers only to the terms (of Eq. 162) for which |x — y| > ko. Now, if we replace (2; — 4)? in the 
last equation by (ko)? (noting that the last equation includes only the terms for which |a — p| > ko 
and hence the inequality is not affected by this replacement) then we can write: 


a > So (ko) p; 
j 


[146] Tt is obvious that the Chebyshev inequality is true trivially for 0 <k <1 (since any probability must be < 1). 
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Noting that p; in the last equation represent the probabilities corresponding to |x — pl > ko we can 
rewrite the last equation as: 
1 

io 2 P(I-nl 2 ko) 

which is the Chebyshev inequality. 

PE: Prove the Chebyshev inequality (considering the continuous case). 
2. Show that the results of Problem 6 of § 4.2.2 are consistent with the Chebyshev inequality. 

Answer: 

(a) For k = 1 we havel!47l (according to the Chebyshev inequality) P ( |x — pl > c) < 1, and from 


part (a) of Problem 6 of § 4.2.2 we have P( |a — pol > c) ~ 1— 0.6827 = 0.3173, and hence the two 
results are consistent. 
(b) For & = 2 we have (according to the Chebyshev inequality) P( \a — p| > 20) < 0.25, and from 
part (b) of Problem 6 of § 4.2.2 we have ra |x — p| > 20) ~ 1 — 0.9545 = 0.0455, and hence the two 
results are consistent. 
(c) For k = 3 we have (according to the Chebyshev inequality) P( |x — po] > 30) < 1/9 ~ 0.11, and 
from part (c) of Problem 6 of § 4.2.2 we have P( jz — p| > 30) ~ 1 — 0.9973 = 0.0027, and hence the 
two results are consistent. 
PE: Explain (in words) the logic of our arguments above, e.g. why P( lx — p| > 7) ~ 1— 0.6827. 

3. Show that the result of part (a) of Problem 2 of § 5.2 is consistent with the Chebyshev inequality. 
Answer: From the left hand side of Eq. 161 we have (with reference to Problem 2 of § 5.2 as well as 
Problem 2 of § 5.1):!!4%! 


P( |x—p| > ko) = 1-P(-ko <2-psko) =1-P(p—ko <a <p+ko) 
utko (8/3) +ky/8/9 9 — 4/2k 
d. a 


= 1- / —dx=1 -[ C= 
nce 8 (8/3)—k,/8/9 8 9 


It is straightforward to show (with no need for going through the details about the range of k) that 


9-4V2k — zz is non-positive for all positive k and hence we conclude that P ( \a — pl > ko) <%. 


PE: Repeat the Problem for part (a) of the PE of Problem 2 of § 5.2 (with reference to Problem 2 of 
§ 5.1). 


6.3.3. One-Sided Chebyshev Inequality 


The one-sided Chebyshev inequality is also known as the Cantelli inequality and the Chebyshev-Cantelli 
inequality. According to this theorem, if x is a random variable with mean yz and variance V and a is a 
positive real number then: 


P(ezut+a)s aay (163) 


Problems 


1147] Noting the condition k > 1, we do this case for demonstration. 
[148] We also refer the reader to Problem 2 of § 4. 
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1. Prove the one-sided Chebyshev inequality (considering the continuous case). 
Answer: If ¢ is a real parameter such that a +t > 0 then we have: 


P(x >u+a)=P(a#-—p>a) (see Problem 2 of § 4) 
=Plig—pttisat? (see Problem 2 of § 4) 
= t 
=P pote > 1) (see Problem 2 of § 4) 
at+t 
2 
<P (ee > 7 (see upcoming note) 
at+t 


<j Ga (Eq. 160) 


ey cca (Eq. 121) 


ula? + 2(—p + tha + (—p + t)?] 
(a+ 02 
_ wa?) + 2 + tla) + (-w +t)? 
= aay (Eqs. 119-124) 
plc?) — Qu? + Qty t+ pw? — Qt +? 
(a+t)? 

pla?) — e+? 

(a+t)? 
V+t 


= Co (Eq. 147) 


Now, the inequality P(a > +a) < jot which we already obtained should be valid for any t as long 
as a+t> 0. However, as a minimization inequality (i.e. <) it is appropriate to i one the extreme 
a is minimum. To find the minimum of “ 
that is: 


42 
s we treat tas a 


case of its validity which is won latte = 


variable and differentiate ti 
ae 


d eer _ 2t(a+t)?—Aat+H(V+e?)  2tatt)-2AV+t)  Aat—-V) 


dt |(a+t)? (a+%)4 7 (a+t3 ~ (a+h3 


As we see, this derivative is zero (and thus 7 is minimum considering other tests and condi- 


Gor 
tions)!49| when (at — V) = 0, ie. when t = V/a. On substituting this value of ¢ in e 
get: 


Pat we 
V+t _V+(V/a)? @V+V? V(a?t+V) Vv 
(a+t)?| in oe (V/a))? (a2+V)2_— (a2+V)2 a? +V 


So, finally we get P(a > w+a) < 


< = +<y Which is the one-sided Chebyshev inequality. 


Note: let z = = at have the density function f(z) and w(z) = 22 = (suet): have the density 


function g(w). Reming to Eq. 63 in Problem 2 of § 4 we have: 


g(w) = f[z(w)| « = f (vw) x f (Vw) (164) 


1 1 
V4w - V4w 


[149] We are referring to the test of second derivative which is positive at t = V/a and hence t = V/a is a point of local 
minimum. 
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Now, for any 1 < z < z’ we have (see § 4.3.2 and Eq. 102 in particular): 


PU ay f(z) dz (165) 


Similarly [where w = w(z’)): 


= i f(z) dz (166) 
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as required. 
PE: Explain and justify the use of Problem 2 of § 4 and Eq. 63 in the above proof. 

2. Show that the one-sided Chebyshev inequality is valid for the exponential distribution (see § 4.2.3). 
Answer: From the left hand side of the one-sided Chebyshev inequality (i.e. Eq. 163) we have (with 
reference to Eqs. 94, 102 and 93): 


1 eas oo 
P@>ptay) = p(szt+a)=/ le) ax = f ae °* dx 
. ata ita 
= i uel 2 =a eat) = En ee = 


1 
ee (167) 
Also, from the right hand side of the one-sided Chebyshev inequality we have (with reference to Eq. 
94): 

V ae 1 
ee = 168 
a+V a+, aat+l 18) 


Now, let assume (tentatively) the validity of the one-sided Chebyshev inequality to see if it will lead 
to a correct result (and hence it is valid) or it will lead to a wrong result (and hence it is invalid), that 
is: 


P(a >pt+a)< (Eq. 163) 


a2 +V 
1 2 1 
eltaa ~ g2q2 4 1 


(Eqs. 167 and 168) 
? 
elt > gq? +1 
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? 
eit? — 27 -1>0 (169) 


Now, a > 0 and a > 0 and hence z > 0. So, the function (e'** — 2? — 1) is increasing because its 
derivative is positive for all z > 0. Moreover, at z = 0 we have (e!+* — z? —1) > 0. Accordingly, 
(elt? — z? — 1) > 0 for all z > 0 and hence the inequality of Eq. 169 (which is obtained tentatively 
from the one-sided Chebyshev inequality) is correct. So, we can conclude that the one-sided Chebyshev 
inequality is valid for the exponential distribution, as required.!!5° 

PE: Can we use the above argument (possibly with some modifications) to show the validity of the 
one-sided Chebyshev inequality for the geometric distribution (see § 4.1.5)? 


6.3.4 Cauchy-Schwarz Inequality 


According to this theorem, if x and y are random variables for which pu(ay), (x?) and p(y?) do exist 
then: 


[w(ay)]” < w(x?) p(y?) (170) 


Problems 


1. Prove the Cauchy-Schwarz inequality. 
Answer: We have (assuming c € R): 


pu [(@ — cy)*] = 0 
p(x? — 2caxy + c?y”) > 0 
p(x?) — 2cpu(ay) + u(y?) > 0 


ula?) —2 [MED pay) + [LED] nr) 20 (c= [2H] eR, uy?) >0) 


[(x — cy)” > 0, see notes of § 5.1] 


(Eqs. 119-124) 


u(y?) u(y?) u(y?) 
» wey), [wley)]? 
I-20) + aa) 7° 
2 [(ay)]” 
WD gay = 
p(x?) p(y?) — [u(ay)] >0 [H(y) > 0, see notes of § 5.1] 


We note that u(y”) > 0 (ie. in lines 4 and 7) is justified (as indicated) by the fact that y? is positive 
(assuming the random variable y is non-trivial, i.e. it is not identically zero) and hence its mean must 
be positive (see the last point in the preamble of § 5.1). 

PE: Fully explain and justify (in words) each step of the above proof. 


6.4 Iterated Expectations Theorem 


This theorem (which is also known as the law of iterated expectations) was investigated briefly in 
Problem 11 of § 5.1 (and this should be enough for what we need in this book). 


[150] This kind of arguments may not look straightforward, however we can reverse the above derivation. 


Chapter 7 
Applications 


In this chapter we present a tiny sample of the applications of probability and probability theory (as well 
as related subjects like counting) in a number of branches and disciplines (e.g. mathematics, physics, 
biology, etc.). In the sections of this chapter, we will investigate these applications mainly in the form of 
solved Problems. However, we should note that these Problems do not necessarily represent real life issues 
and case studies, and hence some (and possibly most) of them are based on hypothetical situations that 
mimic real life issues and situations. Our purpose, after all, is the investigation of probability rather than 
the investigation of these specific branches (like physics or biology). Anyway, even real life case studies are 
usually based on (or associated with) some modeling simplifications, idealistic hypotheses and unrealistic 
assumptions and hence they are, to some degree, idealized and hypothesized. 


7.1 Calculus 


It is well known that simulated probabilistic processes can be used to integrate definite integrals nu- 
merically (or rather computationally and stochastically). This sort of stochastic integration is especially 
useful when integration by analytical methods is difficult or impossible. The idea of this type of numerical 
integration is simply based on the fact that definite integral (in one variable) represents the area under 
a curve (representing the integrand)!>4] between the two limits of the integral. Hence, if we randomly 
select points in a given area (say a rectangle A) that contains the area of the definite integral (say B) 
then the probability P of the points being inside B is equal to the number of points inside B divided 
by the total number of points generated, ie. inside A. Therefore, the area B (which is the same as the 
value of the definite integral) is equal to A times P, i.e. B = A x P.l4*?] This sort of probabilistic (or 
stochastic) numerical integration in one variable (or 1D) space can be easily generalized and extended to 
multi-variable spaces (also see Problem 1). 

In the following points we draw the attention to some useful remarks: 
e Large number of randomly selected points are usually needed to get satisfactory results (and this may 
require considerable computing time although this time is usually trivial on modern computers). In fact, 
the number of required points for a given problem!!*#] depends on the required level of accuracy as well 
as the dimensionality (e.g. being in one variable or in two variables) of the problem in question. 
e The random selection of points is done by using random number generators. The number of required 
random number generators is proportional to the dimensionality of the problem. These random number 
generators are coupled to produce random point generators (see the note of Problem 1 for more details). 
e The advantages of this method of stochastic integration include: ease of implementation, flexibility in 
application, and possibility of being the only possible or practicable method of integration (and hence 
it becomes a necessity rather than a choice). The disadvantages of this method include: being an ap- 
proximate method, requirement of computational equipment and resources, requirement of programming 
skills and resources, possibility of taking considerable computational time (although on modern computing 
equipment this is true only in exceptional cases and circumstances). 


Problems 


[1511 To be more accurate we should say: the area between a curve and an axis (usually the x axis). 

[152] For simplicity, we use A and B to refer to the geometric entities as well as their areas (i.e. “area” is used in two 
meanings). 

1153] We note that the type of problem (and its size in particular) is the main factor for determining the number of required 
points. 
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1. Explain in more details the stochastic (or probabilistic) methods used to evaluate definite integrals. 
Answer: There are two main methods: 
(a) The method that we outlined above (for the case of 1D) where we randomly sample points inside 
a given rectangle and hence calculate the proportion of the area under the curve of the integrand to 
the total area of the rectangle (from the proportion of the number of points under the curve to the 
total number of points inside the rectangle). The extension of this method to higher dimensions is 
straightforward. This method is used in Problem 2 (for 1D) and Problem 3 (for 2D). 
(b) The method (for the case of 1D) of randomly sampling the integrand (by sampling the interval of 
integration which represents the independent variable) and hence estimating its average value where 
this average can be used in conjunction with the mean value theorem (of calculus) to calculate the 
area under the curve of the integrand which is the same as the area of the rectangle whose base is 
the interval of the integral and whose height is the average value of the integrand (that we estimated 
probabilistically). The extension of this method to higher dimensions is straightforward. This method 
is used in Problem 4 (for 1D) and Problem 5 (for 2D). However, we should note that this method is 
virtually redundant and does not seem to offer a tangible advantage because we can sample the interval 
(i.e. the x-interval in 1D or the xy area in 2D) systematically rather than stochastically. Nevertheless, 
we included this method to demonstrate the usability of stochastic processes in such applications.!!54] 
Moreover, the coding of the stochastic process could be simpler than the coding of systematic (or 
deterministic) sampling. 
Note: for the method of part (a) the required number of random number generators is n + 1 where n 
represents the number of variables. So, for 1-variable problems we need to couple two random number 
generators (to select points in an area, one of its dimensions represents the independent variable while 
its other dimension represents the dependent variable), for 2-variable problems we need to couple three 
random number generators, and so on. For the method of part (b) the required number of random 
number generators is equal to the number of variables (and hence n random number generators are 
required for n-variable problems). 
PE: Describe in sufficient details the application of the above two methods in 2D and 3D problems. 
2. Write computer codes to integrate the following 1-variable definite integrals numerically using the 
stochastic method of part (a) of Problem 1: 
(a) fo? madx (1 <a < 22 < 100). (b) [PP er/4da (-10 <a < a2 < 10). 
Answer: We used C++ programming language to do these definite integrals. The results are similar 
to the results of the analytical solutions. The numerical results generally improve as the number of 
random points increases. 
(a) See Integrate1DM1Log.cpp file. (b) See Integratel1DM1Exp.cpp file. 
PE: Do the following: 
(a) Comment the Integratel1DM1Log.cpp and IntegratelDM1Exp.cpp codes explaining what each line 
is supposed to do. 
(b) Calculate stochastically the value of 7 by calculating the area of a circle of unit radius inscribed 
inside a square of area 4 (noting that this is not a calculus problem but it can be solved by a similar 
method to the method of stochastic integration which we described already in part a of Problem 1). 
3. Write computer codes to integrate the following 2-variable definite integrals numerically using the 
stochastic method of part (a) of Problem 1: 


(a) [iP fr wy? dandy = (OS ai<a2<5 and 0<y1 <y <5). 

(b) i. dss sin(a) cosh(y) dx dy (O<a,<a2<7/2 and O0<y, <yo< 7/2). 

Answer: We used C++ programming language to do these definite integrals. The results are similar 
to the results of the analytical solutions. The numerical results generally improve as the number of 
random points increases. 


[154] Ty fact, the main purpose of this method is to show the possibility of achieving some deterministic processes (which is 
integration in this case) stochastically by using probabilistic approaches and techniques. 
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(a) See Integrate2.DM1XYY.cpp file. (b) See Integrate2DM1sinXcoshY.cpp file. 


PE: Repeat the Problem for the following integrals:!1>5| 


(a) se » @ydedy = (OSa1<a2<5 and O0<y <yo<5). 


(b) Ee Jc cos(a) sinh(y)dedy = (O<a1<ag<n/2 and 0<y <yo< 7/2). 


4. Repeat Problem 2 using this time the stochastic method of part (b) of Problem 1. 
Answer: We used C++ programming language to do these definite integrals. The results are similar 
to the results of the analytical solutions. The numerical results generally improve as the number of 
random points increases. 
(a) See Integrate1DM2Log.cpp file. (b) See Integratel1DM2Exp.cpp file. 
PE: Comment the Integratel1DM2Log.cpp and Integratel DM2Exp.cpp codes explaining what each line 
is supposed to do. 

5. Repeat Problem 3 using this time the stochastic method of part (b) of Problem 1. 
Answer: We used C++ programming language to do these definite integrals. The results are similar 
to the results of the analytical solutions. The numerical results generally improve as the number of 
random points increases. 
(a) See Integrate2DM2XYY.cpp file. (b) See Integrate2DM2sinXcoshY.cpp file. 
PE: Comment the Integrate2DM2XYY.cpp and Integrate2DM2sinXcoshY.cpp codes explaining what 
each line is supposed to do. 

6. Outline a possible use of stochastic processes in differential calculus. 
Answer: Due to the close relation between integral calculus and differential calculus, some of the 
stochastic methods of integration may be used to solve initial-value differential equations. The most 
direct and “intuitive” method of such use is outlined in the following points (noting that this method 
exploits the integration method of part b of Problem 1): 
e Let assume we want to solve a differential equation of the form dy/dx = f(x) with the initial condition 
Yo = y(o) over the interval [19, vy]. 
e We divide the interval [x9, x,] to n sub-intervals. 
e We integrate stochastically over the first interval (using the method of part b of Problem 1) and add 
the value of this integral (say dy;) to the initial value (i.e. yo), and hence we obtain the solution at 
i 1.e. yi(@1) = V0 + Oyt. 
e We take y; as the new initial value and integrate over the second interval and hence we obtain the 
solution at x, ie. yo(r2) = yr + Yo. 
e We continue this process until we obtain the solution at rn, ie. Yn(@n) = Yn—1 + OYn- 
However, we note on this method the following: 
* The differential equations that lend themselves to solution by this method are usually of very simple 
form (normally linear first order). However, the method may be elaborated to tackle differential 
equations of more difficult and elaborate forms. Moreover, it can be useful (like other numerical 
methods) for solving differential equations which are difficult or impossible to solve analytically (even 
though they may be of simple form). 
* Because this method uses the method of integration of part (b) of Problem 1, it faces the same 
criticism as the criticism indicated in that Problem, i.e. it is virtually redundant and does not seem 
to offer a tangible advantage over systematic sampling of points or over other numerical methods. 
However, it is generally simpler in coding in comparison to comparable numerical methods (e.g. finite 
difference) and possibly even to systematic sampling, and this could be an advantage that can justify 
its use even in simple cases where other methods are available and viable. 
Anyway, we demonstrate the application of this method in the next Problem (mainly for the purpose 
of demonstrating the use and applicability of stochastic methods in differential calculus) despite its 
triviality. 
PE: Can we use the method of integration of part (a) of Problem 1 to solve this type of differential 


1155] This kind of PE can be easily done by modifying the provided codes. 
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equations? 
7. Solve the following initial-value differential equation using the method of Problem 6: 


d 
i = ge 2/? with y(z =0) =-1 (for 0 < a < 10) 
xv 
Answer: See StochasticDifferential.cpp file. 
PE: Comment the StochasticDifferential.cpp code explaining what each line is supposed to do. 


7.2 Physics 


Random events and processes are everywhere in Nature and hence the probability theory has many 
applications in physics (and physical sciences in general). In fact, there are certain branches of physics 
that are fundamentally based on probability such as statistical mechanics and quantum mechanics (noting 
the different nature of reliance on the probability theory in these two branches). 


Problems 


1. What “phase space” in statistical mechanics means? 
Answer: It means (in the terminology of probability theory) sample space. 
PE: Mention other concepts of probabilistic nature in statistical mechanics. 

2. How probabilities in quantum mechanics are expressed and quantified? 
Answer:!!°4l If a quantum object (say an electron) is in a state represented by the (normalized) 
wavefunction w(r,t) then Ib? d°r represents the probability of finding the object in the infinitesimal 
volume element d?r around the position r at time t. Accordingly, if we integrate the probability density 
Ib|? over a patch of space we get the probability of finding the object in that patch at time t (and 
hence the integral should equal unity if the patch represents the entire space). 
PE: Compare the probability in quantum mechanics with the probability in statistical mechanics. 

3. Give an example of a continuous probability function (i.e. density function) that is commonly used in 
physics to model the distribution of certain properties of gases. 
Answer: It is the Maxwell-Boltzmann distribution which is given by: 


2m mM 9 _mve 
— ee ~~ 2kT ee 
EEN OnkT Us, @ RT (0 < vez < co) 


f (vz) 


where v, is the speed of the gas molecules in the x direction, m is the mass of the gas molecules, T is 
the temperature of the gas and k is the Boltzmann constant. 

PE: Show that the Maxwell-Boltzmann distribution (as given by the above equation) satisfies the 
conditions of probability density functions, i.e. Eqs. 86 and 87. 

4, Referring to Problem 34 of § 2.2, find the probabilities of occupancy in the three cases (i.e. what is 

the probability of any given possible occupancy in each one of the three cases). 

Answer: 

(a) For Maxwell-Boltzmann statistics, the probability of each possible occupancy is 1/k” because 
we have k” equally likely possibilities for occupancy (see part a of Problem 34 of § 2.2). 

(b) For Fermi-Dirac statistics, the probability of each possible occupancy is 1/C* because we have 
C* equally likely possibilities for occupancy (see part b of Problem 34 of § 2.2). 

(c) For Bose-Einstein statistics, the probability of each possible occupancy is 1/C"+*—! because we 
have C"+~! equally likely possibilities for occupancy (see part c of Problem 34 of § 2.2). 

PE: Can we order these probabilities (i.e. by using inequalities)? If so, order them increasingly. 

5. In quantum mechanics, the degeneracy in 3D simple harmonic oscillator requires calculating the number 
of ordered triplets of non-negative integers n,, 72,3 restricted by the condition that nj +.ng+n3 =n 
with n being a positive integer. Find a formula for the number of such triplets and give some examples 
of such triplets. 


[156] This answer represents just an example. 
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Answer: Referring to part (c) of Problem 34 of § 2.2, we can consider “ordered triplet” as 3 states 
and consider n as the number of indistinguishable particles and hence we can use the Bose-Einstein 
statistics with k = 3. Hence, the number of such triplets is C?+*~1 = C?+8-! = Cr+, For example: 


e If mn =1 then we have C"t? = C? = 3 triplets which are: 

(1, 0, 0) (0, 1, 0) (0, 0, 1) 
e If n = 2 then we have C”*? = C} = 6 triplets which are: 

(1, 1,0) (1,0, 1) (0, 1,1) (2,0, 0) (0, 2, 0) (0, 0, 2) 
e If n =3 then we have C”*? = C} = 10 triplets which are: 

(1,1, 1) (2, 1,0) (2,0, 1) (1, 2, 0) (0, 2, 1) 
(1, 0, 2) (0, 1, 2) (3, 0, 0) (0, 3, 0) (0, 0, 3) 
e If n =4 then we have C?** = C$ = 15 triplets which are: 

(2,1, 1) (1, 2, 1) (1,1, 2) (2, 2,0) (2,0, 2) (0, 2, 2) (3, 1,0) (3, 0, 1) 
(1, 3,0) (0, 3, 1) (1,0, 3) (0, 1,3) (4, 0, 0) (0, 4, 0) (0,0, 4) 


PE: Find all the triplets for n = 5. 

. Ina quantum physics (or particle physics) experiment a source of emission is placed at S (see Figure 29) 
where at any non-terminal node (or point or junction) the emitted particles reaching that node (includ- 
ing S where the particles are emitted) can go randomly (with equal probability) in any one of the avail- 
able one-way guided tacks that branch from that node (as shown in Figure 29). What is the probability 
that an emitted particle can reach the terminal (or destination) points D,, D2, D3, Da, Ds, Dg, D7? 
Answer: If P(s,) symbolizes the probability of reaching node s; and P(Dj,|s,) symbolizes the prob- 
ability of reaching node D, given that the particle reached point s, (and similar symbols apply to the 
other s and D nodes) then we have (noting that all other probabilities are zero): 


(a) P(s;) = 1/4. (b) P(sx) = 1/4. (c) P(s3) = 1/4. (d) P(s4) = 1/4. 
(e) P(Dy|s:) = 1. (f) P(Da|s2)=1/3. —— (g) P(Dsls2)=1/3. ——(h) P(Dals2) = 1/3. 
(i) P(Dalss)= 1/2. (§) P(Ds|s3)=1/2. ——(k) P(Delsa)= 1/2. (1) P(Drlsa) = 1/2. 


Now, if we note that reaching s1, 52, 53,84 are mutually exclusive events whose union represents the 
entire sample space (and hence we can use Eq. 150) then we have:!!57! 


4 
P(D:) = S_P(s;)P(Di s)= (7x1) LO+040=- 


i= 4 A 

‘ ae 1 

P(D2) = >. P(si) P(D2 si) =0+ Z°3 +0+0= 75 

1 1 1 

P(Ds) = S7P(si)P(Dsls:) =0+ (7x 3) +0+0= 7, 
PD = SP SOA aioe ane oe 
i as aaa aa ag Aye Oa 

11 1 

P(Ds) = /P(si)P(Ds|si)=0+0+ (7x5) +0=5 

‘ ae eae 

P(De) = S_ P(si) P(De|si) =0+0+0+4 ea eae 


[157] As before, P(D;) is the probability of reaching the terminal node D; where j = 1,2,...,7. 
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Figure 29: The schematic of Problem 6 of § 7.2. The filled circles represent nodes (or junctions) while 
the directed lines represent the one way tracks (noting that at any non-terminal node the probability of 
going through any available one-way track originating from that node is the same). 


4 
P(D7z) = > P(si) P(Dr|s:) = 0+ 040+ (G. 5) _i 


PE: Should we have ae P(D;) = 1? Can we consider this as a (partial) check for the validity of 
the obtained results? 

7. In quantum physics, the wavefunction ~ of an electron in the 1s orbital in a hydrogen atom is given 
by w = Ae~’/% where A is a constant, r is the radial distance from the center of the atom and av is 
the Bohr radius. Find the following (for this 1s electron): 

(a) The constant A. (b) Its mean distance from the center. (c) The variance of its distance. 
Answer: 

(a) The 1s orbital is spherically symmetric and hence it has only radial dependency. In quantum 
mechanics, the probability density function is given by ae = W*w. By the normalization condition of 


probability, the integral of the probability density function over the entire space should be unity, that 
is (where dr is an infinitesimal volume element): 


fo wear = 1 
all space 


/ Aze72"/e dr = ] 
all space 
+oo 
i, A? e~?"/% Agr? dr = 1 
r=0 
An A? | e2r/40 72 dp = 1 
0 


3 
4m A? x Zc 
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1 
an a 


(b) The mean distance is given by (see Eq. 116):!15*) 


aa °° Ar? 4 ° 4 Fs 
br) = | rf(r)dr= i. r i e7 2/40 dp = =| 73 @—2r/00 dp = aux aap = bet 
0 0 a9 a Jo 49 8 2 


(c) The variance is given by (see Eq. 138): 


[oe 4 2 4 [oe 
V(r) = ih (r — pp)? See dr=-— (r? — Quer? + per?) e 27/40 dr 
0 a 49 Jo 
4 f° 9a 4 3a 3a? 
= 4 G — 3aor® + 27?) e 727/40 dp = aus S80: = S80 
ap Jo 4 ag 16 4 


PE: Do the following: 

(a) Justify each step of the above derivations. 

(b) Calculate the numeric value of A, u(r) and V(r). 
(c) Find o(r) and calculate its numeric value. 


7.3 Biology 


One of the best known examples of the use of probability theory in biological sciences is in genetics. 
Probability theory (and related branches) is also essential in the quantitative investigation of infectious 
diseases, epidemics, pandemics and so on. In fact, it underlies almost all the quantitative investigations 
related to health and medical issues noting that these types of investigation generally rely on statistical 
data and they lack reliable analytical models (unlike physical sciences, to some degree, for instance) and 
hence they heavily rely on probability and statistics. 


Problems 


1. Give an example of the “use of probability” by animals. 
Answer: The principle of “safety in numbers” which many social animals use to avoid predators (or 
rather reduce their chance of being caught by predators) is an example of the “use of probability” by 
animals. 
PE: Give other examples of the “use of probability” by animals (e.g. in mating and breeding habits of 
some animals or living beings in general). Also give some examples of the use of probability (consciously 
and unconsciously) in our daily life. 

2. A virus test produces a positive result in 98.23% of the cases of infection (i.e. the result is correct). It 
also produces a positive result in 0.09% of the cases of non-infection (i.e. the result is incorrect).!1°9! 
If this test is conducted on a person picked randomly from a population with 0.013% infection rate 
and it tests positive, what is the chance of this person being really infected? 
Answer: Let adopt the following: 
e A is the event that “the person is really infected”. 
e B is the event that “the test is positive”. 
The required probability is P(A|B), i.e. the probability that the person is really infected given that 
the test is positive. From the given information we have: 


P(A) = 0.00013 P(A) = 0.99987 P(B|A) = 0.9823 P(B{A) = 0.0009 


[158] It, should be noted that: |:)|? is the probability density function with regard to space (i.e. corresponding to d3r) and 
4r? 
ap 
[159] Tn brief, if we assume that the test produces only positive and negative results (i.e. there is no possibility of indeterminate 
results) then we can say: in the cases of infection it produces 98.23% correct positive results and 1.77% incorrect negative 
results, while in the cases of non-infection it produces 99.91% correct negative results and 0.09% incorrect positive results. 


e~2r/20 is the probability density function with regard to radial distance (i.e. corresponding to dr). 
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On substituting these values in Eq. 152 we get: 


P(A) P(B|A) 0.00013 x 0.9823 


AB) P(A) P(B|A) + P(A) P(BJA) 0.00013 x 0.9823 + 0.99987 x 0.0009 


~ 0.124271348 


This result may look odd because we have 98.23% success rate (i.e. when the test is positive). However, 
this oddity should disappear if we notice that the actual infection rate is 0.013% which is very small 
and this should drive the probability down to this low level. In loose terms, if the infection rate of 
population is very small then the probability (before test) of this person being infected is very small 
and hence even if the probability of correctness of test is very high the final probability li-e. the 
probability (after positive test) of this person being really infected given that the infection rate is 
very small] should be relatively small because this final probability is the result of coupling these two 
probabilities where the high probability of correctness of test is moderated by the low probability of 
this person (i.e. prior to test and as a member of population) being infected. 

Anyway, let verify (partially) this result. We have four possible cases for the result of the test: 

e The test is positive and the person is really infected. The probability in this case is P(A |B). 

e The test is positive and the person is not really infected. The probability in this case is P(A|B). 

e The test is negative and the person is really infected. The probability in this case is P(A |B). 

e The test is negative and the person is not really infected. The probability in this case is P(A|B). 
Now, the probabilities of the first two cases should add up to unity (because if the test is positive then 
either the person is infected or not). Similarly, the probabilities of the last two cases should add up 
to unity (because if the test is negative then either the person is infected or not). If we calculate the 
probabilities of the last three cases as we did already for the first case (using Eq. 152) then we get 
[noting that P(B|A) = 0.0177 and P(B|A) = 0.9991]: 


= P(A) P(B|A) 0.99987 x 0.0009 

P(A|B) = — — = ~ 0. 28652 
iP) P(A) P(B|A) + P(A) P(B|A) 0.99987 x 0.0009 + 0.00013 x 0.9823 MPT etaee? 

— P(A) P(B|A) 0.00013 x 0.0177 

P(A|B) = = — ~ 0. 2 
ANP) P(A) P(B|A) + P(A) P(B|A) — 0.00013 x 0.0177 + 0.99987 x 0.9991 ee 
— P(A) P(B|A) 0.99987 x 0.9991 

P(A|B) = ——_—— — = ~ 0. 
ae P(A) P(B|A) + P(A) P(B|A) 0.99987 x 0.9991 + 0.00013 x 0.0177 eee 

As we see, P(A|B) + P(A|B) =1 and P(A|B) + P(A|B) =1. 

Note: the very low P(A|B) should endorse and clarify the rationale of our argument (which we gave 


in the answer) further because the very low probability of being infected is pushed down further by 
the highly reliable negative test which implies that the chance of being infected is very low. Similarly, 
the very high P(A|B) should endorse and clarify the rationale of our argument further because the 
very high probability of being non-infected is pushed up further by the highly reliable negative test 
which implies that the chance of being non-infected is very high. 

PE: Why we used the Bayes theorem in the form of Eq. 152 instead of the form of Eq. 149? 


7.4 Gambling 


Gambling is entirely based on the theory of probability. Historically, the emergence of this theory (in 
its mathematical form) was largely as a response to the demand of gamblers to make good predictions 
and decisions (and hence increase their chances of winning). Therefore, we can consider gambling as the 
birthplace of the probability theory. 


Problems 


1. A gambler participated in a game of lotto where the player chooses five numbers from the numbers 
1,2,...,30 and he wins if his numbers hit the jackpot. If the price of ticket is $1 and the prize money 
is $100000, is he a winner or a looser (i.e. statistically)? Assume in your answer that only one player 
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can participate at any game. 

Answer: The number of combinations for choosing 5 numbers out of 30 is C3? = 142506. This means 
that on a statistical (or probabilistic) basis he needs to play 142506 times (i.e. by playing all the 
possible combinations) to win once. In other words, he needs to spend $142506 (by buying 142506 
tickets) to win a $100000 (i.e. the prize money). So, on a statistical basis he is a looser because he 
invests $142506 to get a return of $100000. 

Warning to gamblers: it is noteworthy that for the “statistical basis” to have an effect the gambler 
should play many times (say tens of thousands) which is usually not achievable. So, if he plays one 
time (or a few times) in each lottery draw he is generally a looser even if the prize money exceeds 
$142506.!1l Yes, if he plays many times (i.e. tens of thousands by buying tickets of different numbers) 
in each draw then he should generally be a winner if the prize money exceeds $142506. However, no 
lotto offers such a high prize money (because the lotto organizer will loose). Moreover, buying too 
many tickets in each draw is not a practical possibility (at least for the overwhelming majority of 
gamblers). In fact, we also ignored another important factor that we indicated in the question, i.e. 
in an ordinary lottery many people participate and hence the jackpot can be won by more than one 
participant and hence any winner will take only part of the prize money. So, our advice is to avoid all 
kinds of gambling even when they look profitable.!!6! 

PE: Find the break-even prize money for a problem similar to the present Problem but assume this 
time that six numbers are selected from the numbers 1,2,...,40 and the price of the ticket is $2. 

2. The participant in the game of lotto in the United Kingdom chooses 6 numbers from 1-59. Winning and 
losing (and hence prizes) are determined by the number of winning numbers that he gets in his choice 
(out of 6 subsequently-drawn winning numbers). What are the probabilities of getting 0,1, 2,3,4,5,6 
winning numbers in his choice? 

Answer: This is an example of the hypergeometric distribution (see § 4.1.6) where N = 59, n = 6, 
R=6 and r =0,1,2,3,4,5,6. Accordingly (see Eq. 84): 


P(59,6,6,0) = C§C325°/CR? ~ 0.509515469 


( ) 
P(59,6,6,1) = CPCZ27°/CR° ~ 0.382136602 
P(59,6,6,2) = CO§$C8°>°/CB° ~ 0.097483827 
P(69,6,6,3) = -08C3" 2° /O2" = 0010398275 
P(59,6,6,4) = C§CZ97°/CR° ~ 0.000458747 
P(59, 6, 6,5) ClO. /Os = 7.05708. 10" 
P(59.6,6,6). =. O$G2"5°/Ce? = 2.91030 « 10-* 


As we see, these probabilities add up to 1 as it should be (and hence this is a partial check). 
PE: Repeat the Problem by assuming N = 55,n=5, R=5 and r =0,1,2,3,4,5. 


7.5 Business 


Probability theory plays a central role in many types of business activities. For example, probability 
theory is essential in the insurance industry to make reliable predictions (i.e. of statistical nature) about 
damage or destruction or loss or death, for instance, and hence assess the chance of making profit or loss. 
This assessment is essential to the insurers for creating and tailoring their insurance policies and coverage 
packages and any type of product they offer to their clients. Also, entrepreneurs should consider many 
probabilistic factors in the business models of their projects and in the setting and arrangement of their 
novel enterprises and adventures. 


[160] Some may disagree with this, but we think it is logical (considering the rationale of statistics). 
161] This advice is not only because of potential moral considerations but also because of certain pragmatic considerations. 
Gambling destroys families and brings sorrow, poverty and disasters. 
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7.6 Finance 


The stock and foreign exchange markets are strongly affected by trends and factors of probabilistic nature 
and hence probability (and its theory) has a strong presence in the considerations of (and the decisions 
made by) banks, companies, brokers and individual investors. In fact, there are many examples about the 
role and significance of probability and probability theory in most types of financial activities. 


7.7 Public Opinion and Trends 


Probability theory is central to the disciplines and activities related to gauging public opinions and trends 
(e.g. in politics or marketing or advertising) to make reliable predictions and accurate projections. Ac- 
cordingly, probability theory is important to political parties (or rather their strategists), pollsters and 
advertisers among many other professionals and professions of these types. 


7.8 Meteorology 


The discipline of meteorology is based on many physical factors some of which are deterministic while 
others are probabilistic or subject to fluctuations and uncertainties. The weather forecast, for instance, 
is fundamentally based on many factors and effects of stochastic nature and hence probability theory is 
used (partly) to make reliable predictions about the weather conditions and determine uncertainties and 
margins of error. 


7.9 Social and Political Sciences 


As social and political sciences are generally related to human behavior and activities (which are not 
totally deterministic), probability and its mathematical theory play an important role in these disciplines. 
For example, patterns of migration, trends of cultural development, political instabilities, social upheavals 
and wars can be investigated quantitatively (and partly) by probability theory. 


7.10 Economics 


Economy is based on many stochastic factors which determine for instance growth, decline, inflation, 
trends of markets (locally, nationally and globally), competitions, etc. Therefore, economists (especially 
those whose decisions have an impact on national and international levels) must take many probabilistic 
factors and considerations in their models, decisions, judgments, forecasts, etc. 


7.11 Industry 


Many industrial processes and activities are subject to fluctuations and uncertainties and hence probability 
is essential in the industrial modeling and assessment as well as the expectation of yield and return (e.g. 
whether or not a certain project is profitable considering the availability of raw materials, the demand 
for the final product, the proportion of defective units produced, the risks in transportation of product to 
customers and consumers, etc.). 
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